MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Bin Wang; Tianyao He; Linke Ouyang; Fan Wu; Zhiyuan Zhao; Tao Chu; Yuan Qu; Zhenjiang Jin; Weijun Zeng; Ziyang Miao; Bangrui Xu; Junbo Niu; Mengzhang Cai; Jiantao Qiu; Qintong Zhang; Dongsheng Ma; Yuefeng Sun; Hejun Dong; Wenzheng Zhang; Jutao Xiao; Jiayong Shi; Pengyu Liao; Xiaomeng Zhao; Huaping Zhong; Liqun Wei; Jing Yu; Jie Yang; Wei Li; Shasha Wang; Qianqian Wu; Xuanhe Zhou; Weijia Li; Zhenxiang Li; Zhongying Tu; Jiang Wu; Lijun Wu; Chao Xu; Kai Chen; Wentao Zhang; Yu Qiao; Bowen Zhou; Dahua Lin; Conghui He

arXiv:2604.04771·cs.CV·April 10, 2026

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao

PDF

2 Models 1 Datasets

TL;DR

MinerU2.5-Pro significantly improves document parsing accuracy by focusing on data engineering and training strategies, expanding training data, and refining annotations without changing model architecture.

Contribution

The paper introduces a data-centric approach with novel sampling, verification, and refinement techniques to push the state-of-the-art in document parsing.

Findings

01

Achieves 95.69 on OmniDocBench v1.6, surpassing previous methods.

02

Expands training data from 10M to 65.5M samples.

03

Improves performance without architectural modifications.

Abstract

Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than from architectural differences. Building on this finding, we present MinerU2.5-Pro, which advances the state of the art purely through data engineering and training strategy design while retaining the 1.2B-parameter architecture of MinerU2.5 unchanged. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while mitigating distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Lh23593217/Long-he-mineru-models
dataset· 93 dl
93 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.