Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, Changick Kim

TL;DR
This paper introduces AvisC, a test-time method that reduces hallucinations in large vision-language models by recalibrating attention to irrelevant image tokens without altering the model's architecture.
Contribution
Proposes a novel test-time calibration method, AvisC, that dynamically adjusts attention to mitigate hallucinations in LVLMs without changing their core structure.
Findings
Significantly reduces hallucinations on benchmark datasets.
Improves alignment between attention focus and relevant image regions.
Enhances the reliability of LVLMs in visual understanding tasks.
Abstract
Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations, attributing incorrect or misleading features to images. We observe that LVLMs disproportionately focus on a small subset of image tokens--termed blind tokens--which are typically irrelevant to the query (e.g., background or non-object regions). We hypothesize that such attention misalignment plays a key role in generating hallucinated responses. To mitigate this issue, we propose Attentional Vision Calibration (AvisC), a test-time approach that dynamically recalibrates the influence of blind tokens without modifying the underlying attention mechanism. AvisC first identifies blind tokens by analyzing layer-wise attention distributions over image tokens, then employs a contrastive decoding strategy to balance the influence of original and…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
Strengths: 1. The paper is well-written and organized. 2. The proposed AVISC method is straightforward to implement and requires no additional training or external models.
1. While the paper focuses on LLaVA-7B and InstructBLIP, the evaluation could have included more competitive models such as MiniCPM-V, LLaVA-13B, LLaVA-34B, and various grounding models like GLAMM. 2. The method requires computing logits twice - once with the original input and once with the biased input - essentially doubling the inference time. This computational overhead could be problematic in real-world applications where time efficiency is crucial. The paper would benefit from a more deta
(1) AVISC operates directly during the decoding phase, requiring no additional training, auxiliary models, or complex self-feedback mechanisms, making it lightweight and easy to integrate into existing LVLMs. (2) The authors’ discovery of the blind tokens phenomenon is very insightful. They also demonstrated this phenomenon through experiments and by visualizing bounding boxes and visual tokens.
(1) According to the experimental results, dynamically reducing attention dependence on irrelevant “blind tokens” is effective in reducing hallucinations. However, I am still concerned that this contrastive decoding approach may harm the original conversational abilities of LVLMs. The authors have only evaluated AVISC on hallucination benchmarks and have not tested it on more general benchmarks. (2) Do other models, such as Qwen, also exhibit the same phenomenon, or more finely aligned models l
1. The proposed AVISC methd is easy to follow and obviously improve the performance.
1. The writing of the article needs to be improved. 2. The proposed approach is not sufficiently motivated and leak a necessary baseline.
- The paper introduces a new method, AVISC, that helps reduce hallucinations in LVLMs by adjusting attention to focus on important visual tokens. This approach is simple yet effective. - AVISC is a lightweight, post-hoc method that works without needing extra training or additional models, making it easy to apply across different LVLMs. - The method shows consistent improvement in performance on standard hallucination benchmarks
- The authors' analysis of attention raises some concerns for me, as it seems somewhat coarse-grained, and the way image attention functions across layers appears to vary. Additionally, the occurrence of blind tokens might be better explained by attention sinks[1], which may not significantly affect the model's overall performance. - I am confused by the experimental results about POPE in Tables 1 and 9, as they do not seem to fully align with the result from LLAVA. - The authors did not perf
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Advanced Image and Video Retrieval Techniques
