Processing M.A. Castr\'en's Materials: Multilingual Typed and Handwritten Manuscripts
Niko Partanen, Jack Rueter, Mika H\"am\"al\"ainen, Khalid Alnajjar

TL;DR
This paper reports on the processing and digitization of Matthias Castrén's manuscripts, enhancing their usability for computational tasks and providing benchmarks for text recognition, thereby supporting further research and digital humanities applications.
Contribution
It introduces workflows and technical infrastructure for processing Castrén's manuscripts, creating datasets for computational analysis, and establishing benchmarks for text recognition tasks.
Findings
Datasets are openly available in Zenodo.
Workflows improve usability for technical applications.
Benchmarks for text recognition are provided.
Abstract
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852). The Finno-Ugrian Society is publishing Castr\'en's manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing
