Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images
Xuchen Li, Xuzhao Li, Renjie Pi, Shiyu Hu, Jian Zhao, Jiahui Gao

TL;DR
This paper introduces ViEBench, a new benchmark for evaluating visual reasoning in vision-language models, focusing on faithfulness and explainability beyond just accuracy, with detailed diagnostics and expert-annotated visual evidence.
Contribution
It presents ViEBench, a process-verifiable benchmark with multi-scenario images and a dual-axis evaluation matrix to assess visual reasoning and grounding fidelity in VLMs.
Findings
VLMs sometimes produce correct answers without relevant evidence.
Models can locate evidence but fail to use it effectively.
ViEBench enables transparent diagnosis of model reasoning behaviors.
Abstract
Despite the remarkable progress of Vision-Language Models (VLMs) in adopting "Thinking-with-Images" capabilities, accurately evaluating the authenticity of their reasoning process remains a critical challenge. Existing benchmarks mainly rely on outcome-oriented accuracy, lacking the capability to assess whether models can accurately leverage fine-grained visual cues for multi-step reasoning. To address these limitations, we propose ViEBench, a process-verifiable benchmark designed to evaluate faithful visual reasoning. Comprising 200 multi-scenario high-resolution images with expert-annotated visual evidence, ViEBench uniquely categorizes tasks by difficulty into perception and reasoning dimensions, where reasoning tasks require utilizing localized visual details with prior knowledge. To establish comprehensive evaluation criteria, we introduce a dual-axis matrix that provides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Child and Animal Learning Development
