Migrating data without original checksums

Abstract

The National Library (NLN) has been using SAM-FS (Oracle HSM) as a bit-repository since 2007. SAM-FS is soon reaching "EOL". The NLN has recently developed a front-end software solution called DPS (Digital Preservation Services). DPS uses HPSS from IBM as underlying bit-repository. DPS/HPSS is intended to replace SAM-FS as the preservation solution for digital objects. DPS requires that all objects must be delivered with associated checksums. In the old SAM-FS bit-repository, many objects lack checksums, especially material from the first years of its use. All objects in SAM-FS are stored in 3 instances. If differences were to be uncovered between the 3 instances, there are no checksums to verify which instance is correct. The total amount of data to be migrated from SAM-FS to DPS is approximately 14 Petabytes. It is estimated that about 1/3 of these data lacks checksums. **Challenge**: How could we ensure that objects migrated from SAM-FS to DPS are the same as those originally archived in SAM-FS when original checksums do not exist? **How we solved it:** Based on access to multiple instances of the preserved objects, stored on different media in SAM-FS, we created a workflow that migrated objects from SAM-FS to DPS using checksums pre-generated from the tape instance to verify the disk instance. This ensured that the migrated data was considered "authentic" relative to the time it was submitted to the SAM-FS bit-repository. **Result**: So far, we have re-archived 3 Petabyte. Generating checksums has prevented data loss in the migration process in one specific case. We experienced that a number of files we read from the disk copy in SAM-FS had incorrect checksums. It turned out that one of the archive disk systems had a corrupt file system after an unexpected shutdown. The way we had created and stored checksums before starting the migration allowed us to quickly discover that there was something wrong with many files to be migrated. If we hadn't had these pre-generated checksums and just relied on reading the disk copy from SAM-FS, generating checksums, and then sending the object to DPS, data would have been lost.

Details

Creators
Thomas Edvardsen; Trond Teigen
Institutions
Date
2024-09-19 13:35:00 +0100
Keywords
information technology for dp; start 2 preserve
Publication Type
lightning talk
License
Creative Commons Attribution 4.0 (CC-BY-4.0)
Download
(unknown) bytes
Slides
here
Video Stream
here
Collaborative Notes
here

View This Publication