Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Laura Manrique-G\'omez, Tony Montes, Arturo Rodr\'iguez-Herrera, and Rub\'en Manrique

TL;DR
This paper introduces a new 19th-century Latin American newspaper corpus and a LLM-based OCR correction framework, enhancing historical linguistic research and digitization accuracy.
Contribution
It provides a novel dataset of Latin American newspapers and a flexible LLM-based OCR correction framework for historical texts.
Findings
Created a comprehensive 19th-century Latin American newspaper corpus
Developed a semi-automated LLM-based OCR correction method
Improved OCR accuracy and linguistic analysis capabilities
Abstract
This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Digital Humanities and Scholarship
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)
