On the Faithfulness of Visual Thinking: Measurement and Enhancement
Zujing Liu, Junwen Pan, Qi She, Yuan Gao, Guisong Xia

TL;DR
This paper investigates the faithfulness of visual reasoning in large vision-language models, revealing current issues and proposing a novel learning strategy to improve the accuracy and reliability of visual information in multimodal reasoning.
Contribution
It introduces an automated evaluation metric for visual faithfulness and proposes SCCM learning, a plug-and-play method that enhances visual faithfulness without requiring annotations.
Findings
Visual information in current MCoT traces is unreliable and insufficient.
SCCM learning improves the faithfulness of visual reasoning in models.
Models with SCCM outperform baselines on perception and reasoning benchmarks.
Abstract
Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
