Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLMs without Training

Shezheng Song; Shasha Li; Jie Yu

arXiv:2601.07359·cs.CV·January 13, 2026

Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLMs without Training

Shezheng Song, Shasha Li, Jie Yu

PDF

Open Access

TL;DR

This paper introduces DualPD, a training-free decoding refinement method for multimodal large language models that improves accuracy by aligning internal reasoning with final outputs through attention-guided contrastive logits and attention head filtering.

Contribution

It proposes DualPD, a novel, training-free approach that refines model predictions by analyzing layer-wise attention shifts and filtering irrelevant attention heads.

Findings

01

Consistently improves accuracy across multiple benchmarks.

02

Effective without additional training or fine-tuning.

03

Applicable to various multimodal models like LLaVA and Qwen-VL.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a variety of vision-language tasks. However, their internal reasoning often exhibits a critical inconsistency: although deeper layers may attend to the correct visual regions, final predictions are frequently misled by noisy attention from earlier layers. This results in a disconnect between what the model internally understands and what it ultimately expresses, a phenomenon we describe as seeing it right but saying it wrong. To address this issue, we propose DualPD, a dual-perspective decoding refinement strategy that enhances the visual understanding without any additional training. DualPD consists of two components. (1) The layer-wise attention-guided contrastive logits module captures how the belief in the correct answer evolves by comparing output logits between layers that exhibit the largest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling