MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang; Yuliang Liu; Zijun Wu; Guosheng Pang; Zhili Ye; Yupei Zhong; Junteng Ma; Tao Wei; Haiyang Xu; Weikai Chen; Zeen Wang; Qiangjun Ji; Fanxi Zhou; Qi Zhang; Yuanrui Hu; Jiahao Liu; Zhang Li; Ziyang Zhang; Qiang Liu; Xiang Bai

arXiv:2511.10390·cs.CV·November 18, 2025

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai

PDF

Open Access 1 Datasets

TL;DR

MonkeyOCR v1.5 is a comprehensive document parsing system that improves layout understanding and content recognition in complex, real-world documents through a two-stage vision-language framework with novel modules and reinforcement learning.

Contribution

The paper introduces MonkeyOCR v1.5, a unified framework with a two-stage pipeline, reinforcement learning for table structure accuracy, and modules for handling embedded images and cross-page tables, advancing document parsing capabilities.

Findings

01

Achieves state-of-the-art performance on OmniDocBench v1.5

02

Outperforms PPOCR-VL and MinerU 2.5 in robustness and accuracy

03

Excels in complex document scenarios with multi-level tables and embedded images.

Abstract

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage pipeline. The first stage employs a large multimodal model to jointly predict layout and reading order, leveraging visual information to ensure sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

inanxr/MonkeyDoc
dataset· 96 dl
96 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques