MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Zhang Li; Yuliang Liu; Qiang Liu; Zhiyin Ma; Ziyang Zhang; Shuo Zhang; Biao Yang; Zidun Guo; Jiarui Zhang; Xinyu Wang; Xiang Bai

arXiv:2506.05218·cs.CV·February 10, 2026

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai

PDF

Open Access 1 Repo 4 Models 1 Datasets

TL;DR

MonkeyOCR introduces a novel document parsing paradigm that simplifies the process by focusing on structure, recognition, and relations, supported by a large diverse dataset and a scalable foundation model, achieving state-of-the-art results.

Contribution

The paper proposes the SRR triplet paradigm for document parsing, introduces MonkeyDoc dataset with 4.5 million instances, and develops a scalable, high-performance foundation model with parameter efficiency techniques.

Findings

01

MonkeyOCR surpasses previous state-of-the-art methods.

02

The SRR paradigm effectively simplifies document parsing tasks.

03

Parameter degradation allows scalable model sizes with minimal performance loss.

Abstract

We introduce MonkeyOCR, a document parsing model that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline and avoids the inefficiencies of processing full pages with giant end-to-end models. In SRR, document parsing is abstracted into three fundamental questions - ``Where is it?'' (structure), ``What is it?'' (recognition), and ``How is it organized?'' (relation) - corresponding to structure detection, content recognition, and relation prediction. To support this paradigm, we present MonkeyDoc, a comprehensive dataset with 4.5 million bilingual instances spanning over ten document types, which addresses the limitations of existing datasets that often focus on a single task, language, or document type. Leveraging the SRR paradigm and MonkeyDoc, we trained a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuliang-liu/monkeyocr
paddleOfficial

Models

Datasets

inanxr/MonkeyDoc
dataset· 96 dl
96 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques