dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Yumeng Li; Guang Yang; Hao Liu; Bowen Wang; Colin Zhang

arXiv:2512.02498·cs.CV·December 18, 2025

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang

PDF

Open Access

TL;DR

dots.ocr is a unified vision-language model that jointly learns document layout detection, text recognition, and relational understanding, achieving state-of-the-art results across diverse languages and domains.

Contribution

This paper introduces dots.ocr, the first end-to-end model for multilingual document layout parsing that leverages joint training and a large synthetic dataset.

Findings

01

Achieves state-of-the-art performance on OmniDocBench.

02

Demonstrates strong multilingual capabilities across 126 languages.

03

Improves performance by approximately 10% relative on XDocParse benchmark.

Abstract

Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots_ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications