TL;DR
This paper introduces a benchmark dataset and a tailored OCR post-correction method for endangered languages, significantly improving text recognition accuracy in data-scarce scenarios.
Contribution
It provides the first benchmark dataset for OCR in endangered languages and proposes a novel post-correction approach that enhances recognition accuracy.
Findings
OCR tools are not robust for endangered languages.
The proposed post-correction reduces error rates by 34%.
Benchmark dataset enables future research in this area.
Abstract
There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
