OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment
Yulong Zhang

TL;DR
OCR-Quality is a new human-annotated dataset of 1,000 diverse PDF pages with quality scores, aimed at improving OCR quality assessment methods and serving as a benchmark for evaluation.
Contribution
The paper introduces OCR-Quality, a comprehensive dataset with manual annotations for OCR quality, facilitating development and benchmarking of assessment techniques.
Findings
Dataset includes diverse real-world documents.
Manual annotations with four quality levels.
Publicly available for research and evaluation.
Abstract
We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
