Automatic Identification and Preservation of National Parts of the Internet Outside a Country’s Top Level Domain

Automatic Identification and Preservation of National Parts of the Internet Outside a Country’s Top Level Domain

Abstract

Preservation of our cultural heritage on the Internet is increasingly in danger of getting lost due to the challenges faced when collecting it. An increasing amount of national webpages are moving to generic Top Level Domains like .com or .org. The movement is so fast that we are at risk of losing it, since we do not get in time to identify the change before it has disappeared again. Therefore this question becomes increasingly crucial for organizations covering digital national heritage including web archives for a specific country. This poster presents the results from a research project that evaluated two different automated approaches to recognise webpages outside a country’s Top Level Domain which are part the country’s cultural heritage. One suggested approach has been to base extraction of national material on a snapshot of the entire Internet in form of a worldwide crawl. Another suggested approach is more silo oriented, based on harvests of web pages referred to by webpages within a National Top Level Domain.

Details

Creators: Eld Zierau
Institutions
Date
Keywords: digital preservation; digital curation; chapel hill
Publication Type: poster
License: CC BY 4.0 International
Download: 355512 bytes

View This Publication