Diverse Digital Collections Meet Diverse Uses: Applying Natural Language Processing to Born-Digital Primary Sources

Abstract

Use of primary sources often focuses on identifying and tracking entities (e.g. people, places, organizations, events) and other values (e.g. dates and times) across documents. There are many existing open-source natural language processing (NLP) tools that can identify and report on named entities, and projects in the digital humanities have previously demonstrated the scholarly value of NLP approaches when working with digitized materials. To date, there has been relatively little adoption of NLP tools for the analysis of born-digital materials by libraries, archives and museums (LAMs). There are a variety of challenges associated with applying NLP tools to born-digital primary source collections, including those forensically acquired from removable media. Many of the challenges relate to the diversity of materials and potential use cases. This paper reports on the BitCurator NLP project, which is developing software for LAMs to extract and expose features in text extracted from such materials. The resulting services and methods can be used by LAM professionals and the users they serve.

Details

Creators
Christopher Lee; Kam Woods
Institutions
Date
Keywords
kyoto
Publication Type
paper
License
CC BY-SA 4.0 International
Download
135616 bytes

View This Publication