On the Faithfulness of Visual Thinking: Measurement and Enhancement

Zujing Liu; Junwen Pan; Qi She; Yuan Gao; Guisong Xia

arXiv:2510.23482·cs.CV·October 28, 2025

On the Faithfulness of Visual Thinking: Measurement and Enhancement

Zujing Liu, Junwen Pan, Qi She, Yuan Gao, Guisong Xia

PDF

TL;DR

This paper investigates the faithfulness of visual reasoning in large vision-language models, revealing current issues and proposing a novel learning strategy to improve the accuracy and reliability of visual information in multimodal reasoning.

Contribution

It introduces an automated evaluation metric for visual faithfulness and proposes SCCM learning, a plug-and-play method that enhances visual faithfulness without requiring annotations.

Findings

01

Visual information in current MCoT traces is unreliable and insufficient.

02

SCCM learning improves the faithfulness of visual reasoning in models.

03

Models with SCCM outperform baselines on perception and reasoning benchmarks.

Abstract

Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.