Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLMs without Training
Shezheng Song, Shasha Li, Jie Yu

TL;DR
This paper introduces DualPD, a training-free decoding refinement method for multimodal large language models that improves accuracy by aligning internal reasoning with final outputs through attention-guided contrastive logits and attention head filtering.
Contribution
It proposes DualPD, a novel, training-free approach that refines model predictions by analyzing layer-wise attention shifts and filtering irrelevant attention heads.
Findings
Consistently improves accuracy across multiple benchmarks.
Effective without additional training or fine-tuning.
Applicable to various multimodal models like LLaVA and Qwen-VL.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a variety of vision-language tasks. However, their internal reasoning often exhibits a critical inconsistency: although deeper layers may attend to the correct visual regions, final predictions are frequently misled by noisy attention from earlier layers. This results in a disconnect between what the model internally understands and what it ultimately expresses, a phenomenon we describe as seeing it right but saying it wrong. To address this issue, we propose DualPD, a dual-perspective decoding refinement strategy that enhances the visual understanding without any additional training. DualPD consists of two components. (1) The layer-wise attention-guided contrastive logits module captures how the belief in the correct answer evolves by comparing output logits between layers that exhibit the largest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
