LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini; Adrien Cavaill\`es; Baptiste Aubertin

arXiv:2601.14251·cs.CV·January 21, 2026

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini, Adrien Cavaill\`es, Baptiste Aubertin

PDF

Open Access 10 Models 3 Datasets

TL;DR

LightOnOCR-2-1B is a compact, multilingual vision-language model that converts document images into accurate, well-ordered text, outperforming larger models in OCR tasks with enhanced localization and robustness features.

Contribution

The paper introduces LightOnOCR-2-1B, a 1-billion-parameter end-to-end model that achieves state-of-the-art OCR performance across multiple languages and document types, with improved localization and efficiency.

Findings

01

Achieves state-of-the-art results on OlmOCR-Bench

02

9× smaller and faster than previous models

03

Enhanced localization and robustness features

Abstract

We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9 $\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis