SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

Weijiang Lv; Yaoxuan Feng; Xiaobo Xia; Jiayu Wang; Yan Jing; Wenchao Chen; Bo Chen

arXiv:2602.07833·cs.CV·February 10, 2026

SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

Weijiang Lv, Yaoxuan Feng, Xiaobo Xia, Jiayu Wang, Yan Jing, Wenchao Chen, Bo Chen

PDF

Open Access

TL;DR

This paper introduces SPD-Faith Bench, a diagnostic tool for assessing faithfulness in multimodal large language models' reasoning, revealing systematic failures and proposing a calibration framework to enhance visual reasoning alignment.

Contribution

The paper presents a novel benchmark for faithfulness evaluation and a train-free calibration method to improve reasoning accuracy in multimodal models.

Findings

01

Identifies perceptual blindness and perception-reasoning dissociation as key failure modes.

02

Shows decaying visual attention and representation shifts cause faithfulness issues.

03

Proposes SAGE, a calibration framework that improves visual reasoning alignment.

Abstract

Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis