Deduplicating Bibliotheca Alexandrina’s Web Archive

Abstract

Archiving web content is bound to produce datasets with duplication, either across time or across location. The Bibliotheca Alexandrina (BA) has a web archive legacy spanning a period of 10 years and is continuing to expand the collection. Initial assessment of this very large store of data was conducted. Given a high enough rate of duplication, deduplication would lead to sizable savings in storage requirements. The BA worked through the International Internet Preservation Consortium (IIPC) to compile best practices for recording duplicates in ISO 28500, the WARC File Format. To deduplicate legacy web archives “after the fact,” the BA is implementing the WARCrefs deduplication tools. Following implementation and testing, the BA plans to put the tools to use to deduplicate its one petabyte of archived web content.

Details

Creators
Eldakar, Youssef; Nagi, Magdy
Institutions
Date
Keywords
web archiving; deduplication; hash algorithms; iso 28500; warc file format; warcrefs; warcsum
Publication Type
paper
License
CC BY 4.0 International
Direct Download
167519 bytes

View This Publication