OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment

Yulong Zhang

arXiv:2510.21774·cs.CV·October 28, 2025

OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment

Yulong Zhang

PDF

TL;DR

OCR-Quality is a new human-annotated dataset of 1,000 diverse PDF pages with quality scores, aimed at improving OCR quality assessment methods and serving as a benchmark for evaluation.

Contribution

The paper introduces OCR-Quality, a comprehensive dataset with manual annotations for OCR quality, facilitating development and benchmarking of assessment techniques.

Findings

01

Dataset includes diverse real-world documents.

02

Manual annotations with four quality levels.

03

Publicly available for research and evaluation.

Abstract

We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.