Machine Learning For Big Text

Abstract

Big datasets can be a rich source of history, yet they pose many challenges to archivists. They can be difficult to acquire and process due to the varied formats and sheer volume of files. Sensitive content must be identified in advance of making materials publicly available. These challenges inhibit access for research purposes and often dissuade archivists from acquiring big datasets. Predictive coding can alleviate these challenges by using supervised machine learning to: augment appraisal decisions, identify and prioritize sensitive content for review and redaction, and generate descriptive metadata of themes and trends. Following the authors’ previous work processing Capstone email, participants will learn about innovative and effective practices to enable digital preservation of large textual datasets at scale. Hands-on experience with specific tools is provided.

Details

Creators
Joanne Kaczmarek; Brent West
Institutions
Date
Keywords
Publication Type
paper
License
CC BY 4.0 International
Download
136749 bytes

View This Publication