Historical Ink: 19th Century Latin American Spanish Newspaper Corpus   with LLM OCR Correction

Laura Manrique-G\'omez; Tony Montes; Arturo Rodr\'iguez-Herrera; and Rub\'en Manrique

arXiv:2407.12838·cs.CL·March 31, 2025·1 cites

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-G\'omez, Tony Montes, Arturo Rodr\'iguez-Herrera, and Rub\'en Manrique

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new 19th-century Latin American newspaper corpus and a LLM-based OCR correction framework, enhancing historical linguistic research and digitization accuracy.

Contribution

It provides a novel dataset of Latin American newspapers and a flexible LLM-based OCR correction framework for historical texts.

Findings

01

Created a comprehensive 19th-century Latin American newspaper corpus

02

Developed a semi-automated LLM-based OCR correction method

03

Improved OCR accuracy and linguistic analysis capabilities

Abstract

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

historicalink/LatamXIX
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Digital Humanities and Scholarship

Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)