A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents
Fitsum Sileshi Beyene, Christopher L. Dancy

TL;DR
This paper reviews OCR evaluation methods, highlighting the neglect of historical and marginalized documents, especially Black newspapers, and discusses how current benchmarks overlook structural failures, causing invisibility and harm.
Contribution
It provides a comprehensive review of OCR evaluation practices, identifies gaps in assessing historical documents, and discusses organizational factors influencing these shortcomings.
Findings
Black newspapers are rarely included in training data or benchmarks.
Current evaluations focus on modern layouts, missing structural failures in historical documents.
Evaluation gaps contribute to structural invisibility and representational harm.
Abstract
Optical character recognition (OCR) and document understanding systems increasingly rely on large vision and vision-language models, yet evaluation remains centered on modern, Western, and institutional documents. This emphasis masks system behavior in historical and marginalized archives, where layout, typography, and material degradation shape interpretation. This study examines how OCR and document understanding systems are evaluated, with particular attention to Black historical newspapers. We review OCR and document understanding papers, as well as benchmark datasets, which are published between 2006 and 2025 using the PRISMA framework. We look into how the studies report training data, benchmark design, and evaluation metrics for vision transformer and multimodal OCR systems. During the review, we found that Black newspapers and other community-produced historical documents rarely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
