Multimodal OCR: Parse Anything from Documents

Handong Zheng; Yumeng Li; Kaile Zhang; Liang Xin; Guangwei Zhao; Hao Liu; Jiayu Chen; Jie Lou; Qi Fu; Rui Yang; Shuo Jiang; Weijian Luo; Weijie Su; Weijun Zhang; Xingyu Zhu; Yabin Li; Yiwei ma; Yu Chen; Yuqiu Ji; Zhaohui Yu; Guang Yang; Colin Zhang; Lei Zhang; Yuliang Liu; Xiang Bai

arXiv:2603.13032·cs.CV·March 20, 2026

Multimodal OCR: Parse Anything from Documents

Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Yuqiu Ji, Zhaohui Yu, Guang Yang, Colin Zhang, Lei Zhang, Yuliang Liu

PDF

Open Access 2 Models 1 Datasets

TL;DR

Multimodal OCR (MOCR) is a new document parsing approach that jointly recognizes text and graphics, enabling faithful reconstruction and understanding of complex documents by treating visual elements as first-class targets.

Contribution

The paper introduces MOCR, a unified framework that parses both text and graphics in documents, leveraging multimodal supervision and a comprehensive data engine for improved document understanding.

Findings

01

Achieves state-of-the-art performance on document parsing benchmarks.

02

Outperforms existing open-source systems in structured graphics parsing.

03

Demonstrates strong reconstruction quality across diverse graphical elements.

Abstract

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

uv-scripts/ocr
dataset· 1.7k dl
1.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Natural Language Processing Techniques