CONTENT-BASED CHARACTERIZATION OF THE END OF TERM WEB ARCHIVE

Abstract

Since 2008, the End of Term Web Archive has been gathering snapshots of the federal web, consisting of the publicly accessible .gov and .mil websites. In 2022, the End of Term team began to package these crawls into a public dataset which they released as part of the Amazon Open Data Partnership program. In total, over 460TB of WARC data was moved from local repositories at the Internet Archive and the University of North Texas Libraries. From the original WARC content, derivative datasets were created that address common use cases for web archives. These derivatives include WAT, WET, CDX and a format called a WARC Metadata Sidecar. This WARC Metadata Sidecar includes content-based characterizations of files held in the archive, including character set, language, file format identifier, and soft 404 detection. This paper describes the decisions made in the creation of these derivatives, the technologies used, and introduces the WARC Metadata Sidecar, which presents a useful approach for creating and storing auxiliary metadata for web archives.

Details

Creators
Phillips, Mark E.; Phillips, Kristy K.; Alam, Sawood
Institutions
Date
Keywords
web archives; end of term web archive; warc metadata sidecar
Publication Type
paper
License
CC-BY 4.0 International
Direct Download
bytes

View This Publication