Duplicate Detection for Quality Assurance of Document Image Collections

Abstract

Digital preservation workflows for image collections involving automatic and semi-automatic image acquisition and processing are prone to reduced quality. We present a method for quality assurance of scanned content based on computer vision. A visual dictionary derived from local image descriptors enables efficient perceptual image fingerprinting in order to compare scanned book pages and detect duplicated pages. A spatial verification step involving descriptor matching provides further robustness of the approach. Results for a digitized book collection of approximately 35.000 pages are presented. Duplicated pages are identified with high reliability and well in accordance with results obtained independently by human visual inspection.

Details

Creators
Reinhold Huber-Mork; Alexander Schindler; Sven Schlarb
Institutions
Date
Keywords
ischool; toronto; canada; digital preservation; information retrieval; image processing
Publication Type
paper
License
CC BY-NC-SA 3.0 AT
Download
1772262 bytes

View This Publication