Designing Scalable Cyberinfrastructure for Metadata Extraction in Billion-Record Archives

Designing Scalable Cyberinfrastructure for Metadata Extraction in Billion-Record Archives

Abstract

We present a model and testbed for a curation and preservation infrastructure, \Brown Dog", that applies to heterogeneous and legacy data formats. \Brown Dog" is funded through a National Science Foundation DIBBs grant (Data Infrastructure Building Blocks) and is a partnership between the National Center for Supercomputing Applications at the University of Illinois and the College of Information Studies at the University of Maryland at College Park. In this paper we design and validate a \computational archives" model that uses the Brown Dog data services framework to orchestrate data enrichment activities at petabyte scale on a 100 million archival record collection. We show how this data services framework can provide customizable workflows through a single point of software integration. We also show how Brown Dog makes it straightforward for organizations to contribute new and legacy data extraction tools that will become part of their archival workows, and those of the larger community of Brown Dog users. We illustrate one such data extraction tool, a _le characterization utility called Siegfried, from development as an extractor, through to its use on archival data.

Details

Creators: Jansen, Gregory; Padhy, Smruti; Marciano, Richard
Institutions
Date
Keywords
Publication Type: paper
License: CC BY-NC-SA 3.0 AT
Direct Download: 1221060 bytes

View This Publication