Text File Format Identification: An Application of AI for the Curation of Digital Records

Abstract

File format identification is a necessary step for the effective digital preservation of records. It allows appropriate actions to be taken for the curation and access of file types. The National Archives has existing processes for dealing with binary file format types, using tools such as PRONOM and DROID. These methods rely on using header information (metadata) and consistent binary sequences. However, these are not appropriate for the identification of text le formats as these do not contain recognisable header information or consistent patterns. Most text formats can be opened as plain text files, however file type information is often needed to understand the files use and context. Automated methods are necessary for text file format identification due to the scale of digital records processed by The National Archives, UK. An Artificial Intelligence methodology was tested and implemented using representative data collected from the GitHub repositories of UK Government departments. The first prototype developed has achieved reasonably good performance in successfully detecting five file formats with similar characteristics. The results encourage us to carry out additional experiments to include further text file format types.

Details

Creators
Santhilata Kuppili Venkata; Paul Young; Alex (The National Archives Green
Institutions
The National Archives; UK)
Date
Keywords
text file formats; supervised learning; digital preservation
Publication Type
paper
License
CC BY 4.0 International
Download
249774 bytes

View This Publication