Reading the unreadable: Creating a dataset of 19th century English   newspapers using image-to-text language models

Jonathan Bourne

arXiv:2502.14901·cs.CL·February 24, 2025

Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models

Jonathan Bourne

PDF

1 Repo

TL;DR

This paper presents a high-quality OCR dataset of 19th-century English newspapers created using a state-of-the-art image-to-text model, significantly improving text accessibility for historical research.

Contribution

It introduces NCSE v2.0, a large, accurately OCRed dataset of 19th-century newspapers using Pixtral 12B, outperforming existing OCR methods and enabling new research opportunities.

Findings

01

Pixtral 12B achieved a median character error rate of 1%.

02

The dataset contains 1.4 million entries and 321 million words.

03

Enhanced article identification and topic classification are demonstrated.

Abstract

Oscar Wilde said, "The difference between literature and journalism is that journalism is unreadable, and literature is not read." Unfortunately, The digitally archived journalism of Oscar Wilde's 19th century often has no or poor quality Optical Character Recognition (OCR), reducing the accessibility of these archives and making them unreadable both figuratively and literally. This paper helps address the issue by performing OCR on "The Nineteenth Century Serials Edition" (NCSE), an 84k-page collection of 19th-century English newspapers and periodicals, using Pixtral 12B, a pre-trained image-to-text language model. The OCR capability of Pixtral was compared to 4 other OCR approaches, achieving a median character error rate of 1%, 5x lower than the next best model. The resulting NCSE v2.0 dataset features improved article identification, high-quality OCR, and text classified into four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JonnoB/reading_the_unreadable
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.