Analyzing Large Web Archive Collections Using Parquet Files

Analyzing Large Web Archive Collections Using Parquet Files

Abstract

Web archives continue to grow in cultural heritage organizations around the world. Improvements in the ability to identify, select, capture and replay these archived web resources are helping many organizations to meet their missions of preservation and access in the digital world. With this increase of archived web resources, additional methods of analyzing large amounts of data are needed to understand the complexity of these collections and help make accurate preservation decisions about the content. The Parquet format is an open source, column-oriented data file format designed for efficient data storage and retrieval. It has implementations in a variety of programming languages and acts as a standardized way of storing large datasets in many big data and distributed storage and analysis applications. It excels in the storage of tabular data with repeated values that are often queried to identify counts, sums, and unique sets of values. To demonstrate the potential for Parquet files to be used in analyzing web archives, we will use the End of Term (EOT) Web Archive as a data testbed for this tutorial. The EOT has been gathering snapshots of the federal web, consisting of the publicly accessible ".gov" and ".mil" websites since 2008. In 2022, the End of Term team packaged these crawls into a public dataset which they released as part of the Amazon Open Data Sponsorship Program. In total, over 460TB of WARC data was moved from local repositories at the Internet Archive and the University of North Texas Libraries. The EOT team generated Parquet files for the contents of each term's web crawls to allow for large-scale analysis of content and to answer questions about the data that previously had been challenging or impossible to answer. In this tutorial we will bring a hands-on experience for the participants to analyze a substantial web archive collection. The tutorial will include introduction to some existing archival collection summarization tools like CDX Summary and Archived Unleashed Toolkit, the process of converting CDX(J) files to Parquet files, numerous SQL queries to analyze those Parquet files for practical and common use-cases, and visualization of generated reports.

Details

Creators: Mark Phillips; Sawood Alam
Institutions
Date: 2024-09-19 11:00:00 +0100
Keywords: approaches to preservation; from document to data
Publication Type: tutorial
License: Creative Commons Attribution 4.0 (CC-BY-4.0)
Collaborative Notes: here