Integrating Preservation Service into a multi-repository environment at CERN

Abstract

In this lightning talk, we present the most challenging aspects of preserving the vastly diverse data of the European Organization for Nuclear Research (CERN), including scientific datasets, multimedia items, and minutes of internal meetings. The presentation will focus on the complexity of the workflow, the expandability, scalability, and the open issue of data monitoring. As part of the Digital Memory project, a platform was built to enable information repositories to benefit from a centralized Preservation Service where system managers can initiate preservation pipelines and view the state of already archived records. After agreeing on a protocol, the platform can harvest data from the different repositories, transform them for long-term preservation, and store them on CERN Cloud storage (EOS). The preservation process includes different steps: validation of SIPs, normalization, and other micro-services (through Archivematica), storage on magnetic tapes, and publishing in a central registry. The initial focal point of the project was to be able to harvest different sources of data; to achieve that, we designed an expandable architecture. Since then we have adapted to numerous repositories including InvenioRDM instances, CodiMD, GitLab, and Indico. Another important aspect was scalability: depending on the collection the workload can be significant, therefore the platform mostly consists of async processes that can run in parallel. To help with resilience the records can be filtered based on the executed steps and status. Our Archivematica instance is deployed on OpenShift where the number of worker clients and available resources can be increased on demand. Monitoring and appraisal are still open issues we are working on. Monitoring of already harvested records means retriggering the preservation pipeline if the data change is significant enough to justify the reprocessing. So far, the platform and registry are designed to support multiple versions of an AIP, but we intend to optimize storage by minimizing the redundancy. Finally, to evaluate if submitted data is part of CERN's official preservation scope, we plan to include internal expert users in the process, who would be aided by a trained model. CERN Preserve workflow

Details

Creators: Jean-Yves Le Meur; Panna Liptak
Institutions
Date: 2024-09-19 11:15:00 +0100
Keywords: information technology for dp; scaling up
Publication Type: lightning talk
License: Creative Commons Attribution Share-Alike 4.0 (CC-BY-SA-4.0)
Download: (unknown) bytes
Slides: here
Video Stream: here
Collaborative Notes: here

View This Publication