MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai

TL;DR
MonkeyOCR introduces a novel document parsing paradigm that simplifies the process by focusing on structure, recognition, and relations, supported by a large diverse dataset and a scalable foundation model, achieving state-of-the-art results.
Contribution
The paper proposes the SRR triplet paradigm for document parsing, introduces MonkeyDoc dataset with 4.5 million instances, and develops a scalable, high-performance foundation model with parameter efficiency techniques.
Findings
MonkeyOCR surpasses previous state-of-the-art methods.
The SRR paradigm effectively simplifies document parsing tasks.
Parameter degradation allows scalable model sizes with minimal performance loss.
Abstract
We introduce MonkeyOCR, a document parsing model that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline and avoids the inefficiencies of processing full pages with giant end-to-end models. In SRR, document parsing is abstracted into three fundamental questions - ``Where is it?'' (structure), ``What is it?'' (recognition), and ``How is it organized?'' (relation) - corresponding to structure detection, content recognition, and relation prediction. To support this paradigm, we present MonkeyDoc, a comprehensive dataset with 4.5 million bilingual instances spanning over ten document types, which addresses the limitations of existing datasets that often focus on a single task, language, or document type. Leveraging the SRR paradigm and MonkeyDoc, we trained a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques
