PubMed-OCR: PMC Open Access OCR Annotations

Hunter Heidenreich; Yosheb Getachew; Olivia Dinica; Ben Elliott

arXiv:2601.11425·cs.CV·January 19, 2026

PubMed-OCR: PMC Open Access OCR Annotations

Hunter Heidenreich, Yosheb Getachew, Olivia Dinica, Ben Elliott

PDF

Open Access 1 Datasets

TL;DR

PubMed-OCR is a large, annotated corpus of scientific articles from PubMed Central, designed to support layout-aware OCR modeling and evaluation, with detailed annotations at multiple levels.

Contribution

The paper introduces a comprehensive OCR-centric dataset with detailed annotations for scientific articles, enabling advanced layout-aware OCR research and evaluation.

Findings

01

Corpus includes 209.5K articles and 1.3B words.

02

Annotations support layout-aware modeling and QA tasks.

03

Analysis reveals coverage and limitations of the dataset.

Abstract

PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

rootsautomation/pubmed-ocr
dataset· 7.0k dl
7.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Handwritten Text Recognition Techniques · Cell Image Analysis Techniques