DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai

TL;DR
DocSeeker introduces a structured reasoning framework with evidence grounding to improve long document understanding in multimodal models, addressing signal noise and supervision scarcity.
Contribution
It proposes a novel two-stage training paradigm with evidence-aware optimization and resolution strategies for better long document comprehension.
Findings
Outperforms existing models on in-domain and out-of-domain tasks.
Generalizes from short to ultra-long documents effectively.
Synergizes with visual Retrieval-Augmented Generation systems.
Abstract
Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured Analysis, Localization and Reasoning workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
