OCR Error Post-Correction with LLMs in Historical Documents: No Free   Lunches

Jenna Kanerva; Cassandra Ledins; Siiri K\"apyaho; Filip Ginter

arXiv:2502.01205·cs.CL·February 4, 2025·2 cites

OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

Jenna Kanerva, Cassandra Ledins, Siiri K\"apyaho, Filip Ginter

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of open-weight large language models in correcting OCR errors in historical English and Finnish texts, highlighting their potential and current limitations.

Contribution

It systematically assesses various strategies for OCR post-correction using LLMs and provides insights into their performance on historical datasets.

Findings

01

LLMs reduce character error rates in English OCR tasks

02

Performance on Finnish OCR remains below practical usefulness

03

Strategies like parameter tuning and segmentation impact correction quality

Abstract

Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques