CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

Tengda Guo; Jie Leng; Hanlei Li; Yaoyuan Liang; Qingyue Zhang; Dian Yang; Mingyu Zhang; Yuhua Fu; Shao-Lun Huang

arXiv:2605.16416·cs.CV·May 19, 2026

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

Tengda Guo, Jie Leng, Hanlei Li, Yaoyuan Liang, Qingyue Zhang, Dian Yang, Mingyu Zhang, Yuhua Fu, Shao-Lun Huang

PDF

TL;DR

CAVE introduces a structured credit assignment method for visual reasoning that enhances the integration of fragmented visual evidence in vision-language models, leading to improved performance and robustness.

Contribution

The paper proposes CAVE, a novel process-reward approach based on GRPO, and introduces TRACER-Bench for evaluating nonlocal visual reasoning, advancing the state of multimodal reasoning.

Findings

01

CAVE significantly improves performance on fragmented visual evidence tasks.

02

CAVE enhances robustness in longer-range and deeper cross-region dependencies.

03

Experiments show CAVE outperforms existing methods on TRACER-Bench and public benchmarks.

Abstract

Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.