Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
Rheeya Uppaal, Phu Mon Htut, Min Bai, Nikolaos Pappas, Zheng Qi, Sandesh Swamy

TL;DR
This paper emphasizes the importance of visual faithfulness in reasoning chains generated by vision-language models, proposing a new evaluation metric and self-reflection method to improve the reliability of multimodal reasoning.
Contribution
It introduces a novel metric for assessing visual faithfulness in reasoning chains and a self-reflection technique to detect and regenerate unfaithful perception steps without training.
Findings
Reduces unfaithful perception rate in reasoning chains
Maintains final-answer accuracy while improving faithfulness
Enhances reliability of multimodal reasoning models
Abstract
Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Embodied and Extended Cognition · Child and Animal Learning Development
