DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Hao Yan; Yuliang Liu; Xingchen Liu; Yuyi Zhang; Minghui Liao; Jihao Wu; Wei Chen; Xiang Bai

arXiv:2604.12812·cs.AI·May 12, 2026

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai

PDF

TL;DR

DocSeeker introduces a structured reasoning framework with evidence grounding to improve long document understanding in multimodal models, addressing signal noise and supervision scarcity.

Contribution

It proposes a novel two-stage training paradigm with evidence-aware optimization and resolution strategies for better long document comprehension.

Findings

01

Outperforms existing models on in-domain and out-of-domain tasks.

02

Generalizes from short to ultra-long documents effectively.

03

Synergizes with visual Retrieval-Augmented Generation systems.

Abstract

Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured Analysis, Localization and Reasoning workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.