TUTORIAL: AUTOMATED TOPIC MODELLING IN ARCHIVES PORTAL EUROPE

Abstract


Archives Portal Europe (APE, www.archivesportaleurope.net) is the portal of European archives, an aggregator that connects on a single research point the catalogues and digitised archival material of all archives in and about Europe. It currently hosts material from more than 30 countries and in 24 languages, and from a variety of archival institutions (such as State and city archives, university and parish archives, private institutions, etc). In order to navigate this extremely heterogeneous material, one of the research tools made available by Archives Portal Europe is by “topics”, curated collections in which each archival institution participating into the APE project can tag its documents according to a specific topic. Because topics are maintained manually by the archivists, and because of the vast amount of archival material ingested in the portal, it is impossible to have a comprehensive body of topics that describe the whole of the APE repository.

In this scenario, automated topic detection can be a fundamental tool to guide archival research and to allow archives to be accessible to potentially world-wide users, in a situation where national and linguistics barriers blur, or are re-defined.

This workshop presents the creation of an AI tool for automated topic detection in the APE corpus, a vast, inhomogeneous, and multi-lingual collection of historical archival catalogues – the first such project to be designed for archival descriptions rather than corpora of specific documents.

The development is based on supervised machine learning, with a combination of human inputs in different languages (collectively-created taxonomies for each topic), and the usage of Wikipedia pages to model the relevant vocabulary and entities. The first iteration of the algorithm was tested on a sample of 9 topics in 5 languages, and the second iteration enlarged the sample to 13 topics and 12 languages, for a total of more than 500,000 descriptive units, and it also introduced Boolean operators and wildcards.

The workshop will explain how the tool was built, and will allow users to test it live, gathering feedback on its usability and possible future implementations outside of the specific corpus of Archives Portal Europe.

Details

Creators
Arnold, Kerstin
Institutions
Archives Portal Europe Foundation (APEF)
Date
Keywords
automated topic detection
Publication Type
unknown
License
CC-BY 4.0 International
Direct Download
bytes

View This Publication