Challenges in Accessing Information in Digitized 19th-Century Czech Texts

Abstract

This short paper describes problems arising in optical character recognition of and information retrieval from historical texts in languages with rich morphology, rather discontinuous lexical development and a long history of spelling reforms. In a work-in- progress manner, the problems and proposed linguistic solutions are shown on the example of the current project focused on improving the access to digitized Czech prints from the 19th century and the first half of the 20th century.

Details

Creators
Karel Kucera; Martin Stluka
Institutions
Date
Keywords
ischool; toronto; canada; information retrieval; known-item retrieval; historical text; lemma; hyperlemma
Publication Type
paper
License
CC BY-NC-SA 3.0 AT
Download
701725 bytes

View This Publication