PDF Mayhem: Is Broken Really Broken?

Abstract

In this paper, we focus on the quality of PDF files. We are interested in errors that validators report during the validation process: how accurate are these errors and can we build easy workarounds to avoid or even fix these problems? We present our findings from a pilot experiment where we validated more than 200,000 PDF files from well-known corpora with different validators and found several thousand problematic files. We then devised a process of reconstructing the invalid files and analyzing the converted data. Our results show that there are potentially working methods for avoiding problems during the PDF validation and these methods can significantly reduce the workload for preservation specialists who are responsible for the quality of the data. Our further aim is to master and manage PDF validation so that we can build an automated workflow which is able to migrate most of PDF files to PDF/A files during the ingest of a digital preservation repository. To achieve this in reliable manner we need further studies to build on what we have presented here.

Details

Creators
Juha Lehtonen; Heikki Helin; Johan Kylander; Kimmo Koivunen
Institutions
Date
Keywords
boston
Publication Type
paper
License
CC BY 4.0 International
Download
284401 bytes

View This Publication