Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

Shuochen Liu; Pengfei Luo; Chao Zhang; Yuhao Chen; Haotian Zhang; Qi Liu; Xin Kou; Tong Xu; Enhong Chen

arXiv:2511.12003·cs.AI·December 2, 2025

Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, Enhong Chen

PDF

Open Access 1 Video

TL;DR

This paper introduces a reinforcement learning framework called Look As You Think (LAT) that trains vision-language models to generate verifiable, evidence-grounded reasoning paths for visual document question answering, improving accuracy and traceability.

Contribution

The paper proposes the Chain-of-Evidence paradigm and LAT framework to unify reasoning and visual evidence attribution with process-level self-verification in VD-RAG.

Findings

01

LAT improves model performance by 8.23% in soft EM and 47.0% in [email protected].

02

LAT outperforms supervised fine-tuning baseline in accuracy and generalization.

03

Experiments demonstrate LAT's effectiveness on Paper- and Wiki-VISA benchmarks.

Abstract

Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval-augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling