OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches
Jenna Kanerva, Cassandra Ledins, Siiri K\"apyaho, Filip Ginter

TL;DR
This paper evaluates the effectiveness of open-weight large language models in correcting OCR errors in historical English and Finnish texts, highlighting their potential and current limitations.
Contribution
It systematically assesses various strategies for OCR post-correction using LLMs and provides insights into their performance on historical datasets.
Findings
LLMs reduce character error rates in English OCR tasks
Performance on Finnish OCR remains below practical usefulness
Strategies like parameter tuning and segmentation impact correction quality
Abstract
Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
