GovInfo Digital Repository – Scaling Up to the Standards of Sustainable Trustworthy Digital Repository Certification

Abstract

GovInfo to Scale The US Government Publishing Office, GPO, is the only organization in the the United States to be named as a Trustworthy Digital Repository under ISO 16363:2012 Space data and information transfer systems -- Audit and certification of trustworthy digital repositories and certification under CoreTrustSeal, a peer-reviewed form of repository assessment. Currently, GovInfo ingests an average of 6,000 Submission Information Packages (SIPs) of Federal Government information content per week from over 25 Government organizations and authors. GovInfo collections include born-digital publications and government information published as early as 1789. GovInfo has over 3,393,462 Archival Information Packages of digital Government information objects. These packages contain more than 11.6 million individual PDF files, 30 million images, audio files, spreadsheets, and more. In FY23, over 195 thousand packages were added to GovInfo. The approximate volume of all AIP content is 99 TB. Depending on the complexity of the packages submitted for ingest, processing normally takes less than 5 minutes per package. In FY23, the repository received an estimated 1,1 billion document retrievals from GovInfo. Through What Means? GovInfo was established in 2009, prior to the commercial availability of many repository services. GPO has evolved and shaped the repository over time through a series of collection-defined metadata rules, file management operations, processors, and processing mechanisms through a series of micro-services within the overall architectural design of the repository. This poster will feature highlights of this repository design, including newly added tools that allow for multiple types of PDF validation, technical metadata extraction, PII detection, characterization of content, and custom, rule-based parsing of all content. Recent additions to the repository such as the Congressionally Mandated Reports collection highlights how GPO is adding additional machine-readable formats, responding to legislative requirements, and working with federal agencies to increase access to government information. GPO’s team will demonstrate GPO proactively responds to common problems such as file corruption, password protection, image quality, metadata errors, enhanced content description, publication linking, digital signatures, content navigation, among other common problems. Additionally, this poster will feature access points beyond our web interface including RSS, Sitemaps, APIs, and Bulk Data download functionality.

Details

Creators
Alec Bradley; Heidi Ramos; Jessica Tieman
Institutions
Date
2024-09-18 11:00:00 +0100
Keywords
approaches to preservation; scaling up
Publication Type
poster
License
Creative Commons Zero (CC0-1.0)
Download
(unknown) bytes

View This Publication