TL;DR
This paper introduces VaLSe, a framework that interprets and steers the internal representations of vision-language models to reduce object hallucinations by focusing on visual contribution maps and realigning internal features.
Contribution
The paper presents a novel interpretability and mitigation framework, VaLSe, that uses visual contribution maps and latent space steering to reduce object hallucinations in LVLMs.
Findings
VaLSe effectively reduces object hallucinations across multiple benchmarks.
Visual contribution maps reveal the model's focus regions and decision-making process.
Existing OH metrics have limitations, highlighting the need for better evaluation benchmarks.
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The proposed method does not decrease the general task performance. It is essential to address object hallucination while preserving the general capabilities of LVLMs. 2. VTI is the closest related work to the proposed method, VaLSe. In VTI, they generate the positive and negative samples of images and texts. Then, they compute the steering direction using PCA. This direction is used for test-time inference through intervention. The proposed method advances the generation modules of VTI. The
1. In Section 3.3, the paper gives an analysis to understand the model's behavior under perturbed inputs. However, the derivation is hard to follow. - The used notation lacks clear definitions, making it challenging to understand the mathematical reasoning. - The claim that the attention matrix of ideal noise is zero requires further justification. 2. ValSe requires the dataset for learned direction (Eq. 6). However, the optimization detail is ambiguous. Specifically, it is unclear how the hy
Strengths 1. The motivation is strong and well explained, addressing an important issue of object hallucination in LVLMs. 2. The proposed method, VaLSe, is conceptually clear and combines interpretability with mitigation in a meaningful way. 3. The approach provides useful visual contribution maps that make the model’s reasoning more transparent.
Weaknesses 1. The reliability of the visual contribution maps is not fully verified, and the connection between visualization and real causal influence may be weak. 2. The method may add extra computation and inference time because of its two-stage process. 3. The effect of latent space steering may depend heavily on which layers or parameters are adjusted, and this sensitivity is not deeply analyzed.
1. The writing is organized and clear, with the core idea of moving from interpretation to mitigation. 2. The paper usefully points out that existing OH metrics can mislabel visually grounded tokens and backs this with qualitative examples.
1. The proposed interpretation–then–steering pipeline is noticeably heavier than decoding-only hallucination controls. To obtain token-wise visual contribution maps, VaLSe has to (i) identify visual-sensitive tokens via an image/noise comparison and (ii) for each selected token, run a Grad-CAM–style gradient-weighted propagation over all attention layers, which already implies at least one backward-style pass on top of the normal generation. On top of that, the steering direction is computed fro
S1. The framework systematically includes visual‑sensitive token selection, contribution map generation with artifact removal, and steering via PCA. S2. Extensive experiments across models and benchmarks demonstrate consistent reductions in hallucination metrics without harming general performance. S3. The paper is generally well‑structured.
W1. VaLSe requires additional forward passes to compute LLR for each token and gradient‑based contribution maps, followed by PCA on latent differences. While the method is training‑free, inference‑time cost may be high for long responses or high‑resolution images. A clearer discussion of computational cost and optimizations would be useful. W2. The LLR threshold alpha and the number of masked regions are tuned manually. Although ablations suggest robustness, automatic selection strategies or ad
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
