Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Haochen Wang; Xiangtai Li; Zilong Huang; Anran Wang; Jiacong Wang; Tao Zhang; Jiani Zheng; Sule Bai; Zijian Kang; Jiashi Feng; Zhuochen Wang; Zhaoxiang Zhang

arXiv:2507.07999·cs.CV·March 6, 2026

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang

PDF

Open Access 2 Models 3 Datasets

TL;DR

This paper introduces TreeBench, a comprehensive benchmark for evaluating visual grounded reasoning, and TreeVGR, a training method that enhances reasoning accuracy and traceability in vision-language models.

Contribution

The paper presents a new benchmark, TreeBench, for holistic evaluation of visual reasoning, and a training paradigm, TreeVGR, that improves reasoning accuracy and traceability in models.

Findings

01

Models struggle with TreeBench, with none exceeding 60% accuracy.

02

TreeVGR improves performance on multiple benchmarks, including TreeBench.

03

Traceability enhances reasoning accuracy and model interpretability.

Abstract

Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis