To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
Rui Hong, Shuxue Quan

TL;DR
This paper introduces a diagnostic framework for visual language models to analyze their reliance on visual information versus language shortcuts, revealing prevalent visual sycophancy and scale-dependent effects.
Contribution
The Tri-Layer Diagnostic Framework disentangles hallucination sources in VLMs and uncovers systematic visual sycophancy and scale effects without additional training.
Findings
69.6% of samples show visual sycophancy, models detect visual anomalies but still hallucinate.
Zero samples exhibit robust refusal, indicating suppressed truthful uncertainty.
Larger models reduce language shortcuts but increase visual sycophancy.
Abstract
When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy--models detect visual anomalies but hallucinate to satisfy user expectations--while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
