PubMed-OCR: PMC Open Access OCR Annotations
Hunter Heidenreich, Yosheb Getachew, Olivia Dinica, Ben Elliott

TL;DR
PubMed-OCR is a large, annotated corpus of scientific articles from PubMed Central, designed to support layout-aware OCR modeling and evaluation, with detailed annotations at multiple levels.
Contribution
The paper introduces a comprehensive OCR-centric dataset with detailed annotations for scientific articles, enabling advanced layout-aware OCR research and evaluation.
Findings
Corpus includes 209.5K articles and 1.3B words.
Annotations support layout-aware modeling and QA tasks.
Analysis reveals coverage and limitations of the dataset.
Abstract
PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Handwritten Text Recognition Techniques · Cell Image Analysis Techniques
