olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, Luca Soldaini

TL;DR
olmOCR is an open-source toolkit that leverages a fine-tuned vision language model to extract high-quality, structured text from diverse PDFs efficiently and cost-effectively, enabling large-scale language model training.
Contribution
It introduces olmOCR, a novel open-source system that outperforms existing tools and proprietary models in extracting structured content from PDFs at scale.
Findings
olmOCR outperforms top vision language models like GPT-4o and Gemini Flash 2.
The toolkit can process one million PDF pages for only 176 USD.
olmOCR effectively preserves complex structures like tables, formulas, and handwritten text.
Abstract
PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. Traditional open source tools often produce lower quality extractions compared to vision language models (VLMs), but reliance on the best VLMs can be prohibitively costly (e.g., over 6,240 USD per million PDF pages for GPT-4o) or infeasible if the PDFs cannot be sent to proprietary APIs. We present olmOCR, an open-source toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Digital Humanities and Scholarship
