Docopilot: Improving Multimodal Models for Document-Level Understanding

Yuchen Duan; Zhe Chen; Yusong Hu; Weiyun Wang; Shenglong Ye; Botian Shi; Lewei Lu; Qibin Hou; Tong Lu; Hongsheng Li; Jifeng Dai; Wenhai Wang

arXiv:2507.14675·cs.CV·July 22, 2025

Docopilot: Improving Multimodal Models for Document-Level Understanding

Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, Wenhai Wang

PDF

1 Datasets

TL;DR

This paper introduces Docopilot, a native multimodal model trained on a new large-scale dataset, Doc-750K, that significantly improves document-level understanding without relying on retrieval-augmented methods.

Contribution

The paper presents a high-quality, diverse dataset for document comprehension and a native multimodal model that effectively captures cross-page dependencies.

Findings

01

Docopilot outperforms existing models in coherence and accuracy.

02

The dataset enables better understanding of complex, multi-page documents.

03

The model achieves efficient, multi-turn document interactions.

Abstract

Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

OpenGVLab/Doc-750K
dataset· 397 dl
397 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.