CONTENT-BASED CHARACTERIZATION OF THE END OF TERM WEB ARCHIVE

CONTENT-BASED CHARACTERIZATION OF THE END OF TERM WEB ARCHIVE

Abstract

Since 2008, the End of Term Web Archive has been gathering snapshots of the federal web, consisting of the publicly accessible .gov and .mil websites. In 2022, the End of Term team began to package these crawls into a public dataset which they released as part of the Amazon Open Data Partnership program. In total, over 460TB of WARC data was moved from local repositories at the Internet Archive and the University of North Texas Libraries. From the original WARC content, derivative datasets were created that address common use cases for web archives. These derivatives include WAT, WET, CDX and a format called a WARC Metadata Sidecar. This WARC Metadata Sidecar includes content-based characterizations of ﬁles held in the archive, including character set, language, ﬁle format identiﬁer, and soft 404 detection. This paper describes the decisions made in the creation of these derivatives, the technologies used, and introduces the WARC Metadata Sidecar, which presents a useful approach for creating and storing auxiliary metadata for web archives.

Details

Creators: Phillips, Mark E.; Phillips, Kristy K.; Alam, Sawood
Institutions
Date
Keywords: web archives; end of term web archive; warc metadata sidecar
Publication Type: paper
License: CC-BY 4.0 International
Direct Download: bytes

View This Publication