From the World Wide Web to Digital Library Stacks: Preserving the French Web Archives

Abstract

The National Library of France is mandated by French law to collect and preserve the French Internet. It is now a 10-year old project with collections ranging from 1996 to the present. To ensure their long-term preservation, the choice has been made to ingest these web archives into the institution’s existing digital preservation repository, SPAR (Scalable Preservation and Archiving Repository). There were numerous implementation challenges, on the modeling as well as the technical sides, which the library met with solutions drawn from international collaboration and widely adopted standards, whenever possible. – Web archive-specific formats (W/ARC files) lacked validation and characterization tools, which led to the development of a Jhove2 module for the ARC format. – The heterogeneity of BnF’s web archives in terms of formats, production workflows and tools, was managed by aligning all of them on a single model, the current production workflow using NetarchiveSuite. – The specificities of web archives were matched to the PREMIS data model and dictionary and SPAR’s global METS profile. – Finally, the need to express technical information about ARC files in a concise, manageable fashion led us to define a format-specific metadata scheme for container files, containerMD, which will be released to the preservation community (on BnF’s website). All this development work means new services for digital curators in general and preservation experts in particular. They will be able to know their collection better, to check its comprehensiveness, and, with that deeper understanding, to investigate new preservation strategies. Allowing differentiated service level agreements for specific sets of documents, with richer metadata extraction, better quality insurance and differentiated preservation strategies, will be the logical next step of the web archives long-term preservation project.

Details

Creators
Oury, Clément; Peyrard, Sébastien
Institutions
Date
Keywords
singapore; web archives; metadata; characterization tools; arc file format
Publication Type
paper
License
CC BY-SA 3.0 AT
Direct Download
528846 bytes

View This Publication