CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning
Tengda Guo, Jie Leng, Hanlei Li, Yaoyuan Liang, Qingyue Zhang, Dian Yang, Mingyu Zhang, Yuhua Fu, Shao-Lun Huang

TL;DR
CAVE introduces a structured credit assignment method for visual reasoning that enhances the integration of fragmented visual evidence in vision-language models, leading to improved performance and robustness.
Contribution
The paper proposes CAVE, a novel process-reward approach based on GRPO, and introduces TRACER-Bench for evaluating nonlocal visual reasoning, advancing the state of multimodal reasoning.
Findings
CAVE significantly improves performance on fragmented visual evidence tasks.
CAVE enhances robustness in longer-range and deeper cross-region dependencies.
Experiments show CAVE outperforms existing methods on TRACER-Bench and public benchmarks.
Abstract
Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
