OCR Post Correction for Endangered Language Texts

Shruti Rijhwani; Antonios Anastasopoulos; Graham Neubig

arXiv:2011.05402·cs.CL·November 12, 2020

OCR Post Correction for Endangered Language Texts

Shruti Rijhwani, Antonios Anastasopoulos, Graham Neubig

PDF

2 Repos

TL;DR

This paper introduces a benchmark dataset and a tailored OCR post-correction method for endangered languages, significantly improving text recognition accuracy in data-scarce scenarios.

Contribution

It provides the first benchmark dataset for OCR in endangered languages and proposes a novel post-correction approach that enhances recognition accuracy.

Findings

01

OCR tools are not robust for endangered languages.

02

The proposed post-correction reduces error rates by 34%.

03

Benchmark dataset enables future research in this area.

Abstract

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.