GutenOCR: A Grounded Vision-Language Front-End for Documents
Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew

TL;DR
GutenOCR is a grounded vision-language OCR system that fine-tunes large models for document reading, detection, and grounding, achieving significant improvements on specialized datasets with a unified prompt-based interface.
Contribution
The paper introduces GutenOCR, a novel grounded OCR model family that supports comprehensive document understanding through fine-tuning vision-language models with a new evaluation protocol.
Findings
GutenOCR-7B more than doubles the grounded OCR score on test datasets.
Significant improvements in region- and line-level OCR and text detection recall.
Trade-offs observed in page-level linearization and formula-heavy layouts.
Abstract
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Machine Learning in Materials Science · Topic Modeling
