Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Danae S\'anchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott

TL;DR
This paper investigates how vision-language models reason and rely on visual versus textual cues, revealing limitations in current interpretability methods like Chain-of-Thought for understanding modality influence.
Contribution
It provides a comprehensive analysis of reasoning dynamics in 18 VLMs, highlighting the partial transparency of Chain-of-Thought in revealing modality reliance and influence.
Findings
Models exhibit answer inertia, reinforcing early predictions.
Reasoning-trained models show stronger correction but depend on modality conditions.
Models are influenced by textual cues even when visual evidence is sufficient.
Abstract
Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
