TL;DR
This paper presents a high-quality OCR dataset of 19th-century English newspapers created using a state-of-the-art image-to-text model, significantly improving text accessibility for historical research.
Contribution
It introduces NCSE v2.0, a large, accurately OCRed dataset of 19th-century newspapers using Pixtral 12B, outperforming existing OCR methods and enabling new research opportunities.
Findings
Pixtral 12B achieved a median character error rate of 1%.
The dataset contains 1.4 million entries and 321 million words.
Enhanced article identification and topic classification are demonstrated.
Abstract
Oscar Wilde said, "The difference between literature and journalism is that journalism is unreadable, and literature is not read." Unfortunately, The digitally archived journalism of Oscar Wilde's 19th century often has no or poor quality Optical Character Recognition (OCR), reducing the accessibility of these archives and making them unreadable both figuratively and literally. This paper helps address the issue by performing OCR on "The Nineteenth Century Serials Edition" (NCSE), an 84k-page collection of 19th-century English newspapers and periodicals, using Pixtral 12B, a pre-trained image-to-text language model. The OCR capability of Pixtral was compared to 4 other OCR approaches, achieving a median character error rate of 1%, 5x lower than the next best model. The resulting NCSE v2.0 dataset features improved article identification, high-quality OCR, and text classified into four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
