Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning
Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, Enhong Chen

TL;DR
This paper introduces a reinforcement learning framework called Look As You Think (LAT) that trains vision-language models to generate verifiable, evidence-grounded reasoning paths for visual document question answering, improving accuracy and traceability.
Contribution
The paper proposes the Chain-of-Evidence paradigm and LAT framework to unify reasoning and visual evidence attribution with process-level self-verification in VD-RAG.
Findings
LAT improves model performance by 8.23% in soft EM and 47.0% in [email protected].
LAT outperforms supervised fine-tuning baseline in accuracy and generalization.
Experiments demonstrate LAT's effectiveness on Paper- and Wiki-VISA benchmarks.
Abstract
Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval-augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
