Large Synthetic Data from the arXiv for OCR Post Correction of Historic   Scientific Articles

Jill P. Naiman; Morgan G. Cosillo; Peter K. G. Williams and; Alyssa Goodman

arXiv:2309.11549·cs.DL·September 22, 2023

Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Jill P. Naiman, Morgan G. Cosillo, Peter K. G. Williams and, Alyssa Goodman

PDF

Open Access 1 Repo

TL;DR

This paper presents a large synthetic dataset derived from arXiv for improving OCR post-correction of historic scientific articles, demonstrating significant error rate reductions and enabling better classification of inline math segments.

Contribution

The authors created the largest synthetic ground truth/OCR dataset from arXiv for scientific articles, facilitating improved OCR correction models and analysis tools.

Findings

01

Mean character error rate reduction of 7.71%

02

Word error rate improvement of 18.82%

03

F1 score of 77.82% for inline math classification

Abstract

Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arXiv we create, to the authors' knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. We provide baseline models trained with this dataset and find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. When used to classify parts of sentences as inline math, we find a classification F1 score of 77.82%. Interactive dashboards to explore the dataset are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

readingtimemachine/ocr_post_correction
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Handwritten Text Recognition Techniques · Natural Language Processing Techniques