A QUESTION OF CHARACTER: How do we automatically recharacterize data at cloud scales?

Abstract

Many preservation actions that we undertake on digital content are driven by the format of the content in question. Format information is often determined at the point of ingest and is not regularly updated as our knowledge of file formats improves over time. Periodically re-characterizing all content in a repository would ensure that we get more accurate identifications over time, but a more sustainable approach would be to only re-characterize content that was actually likely to have changed. Preservica’s new Automated Active Digital Preservation feature seeks to do exactly this, but even when considering only subsets of the data in our cloud systems, we are faced with significant challenges of scale. In this paper, we describe those challenges, the approach we have taken to implement the feature, and the testing we have performed to verify the viability of this approach.

Details

Creators
Jack O’Sullivan; David Clipsham; Divyesh Soni; Richard Smith; Jonathan Tilbury
Institutions
Date
Keywords
scalability; automation; characterization; preservation actions
Publication Type
paper
License
CC-BY 4.0 International
Download
(unknown) bytes
Slides
here

View This Publication