Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

Yuanlei Zheng; Pei Fu; Hang Li; Ziyang Wang; Yuyi Zhang; Wenyu Ruan; Xiaojin Zhang; Zhongyu Wei; Zhenbo Luo; Jian Luan; Wei Chen; Xiang Bai

arXiv:2604.13731·cs.CL·April 16, 2026

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

Yuanlei Zheng, Pei Fu, Hang Li, Ziyang Wang, Yuyi Zhang, Wenyu Ruan, Xiaojin Zhang, Zhongyu Wei, Zhenbo Luo, Jian Luan, Wei Chen, Xiang Bai

PDF

TL;DR

The paper introduces Doc-V*, an OCR-free, interactive framework for multi-page document VQA that actively navigates and aggregates evidence, significantly improving accuracy and efficiency over existing methods.

Contribution

It presents a novel agentic approach that combines semantic retrieval, targeted page fetching, and evidence aggregation for better multi-page document reasoning.

Findings

01

Outperforms open-source baselines on five benchmarks.

02

Improves out-of-domain performance by up to 47.9%.

03

Effective evidence aggregation achieved with selective attention.

Abstract

Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc- $V^{*}$ , an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc- $V^{*}$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc- $V^{*}$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc- $V^{*}$ outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.