TL;DR
This paper introduces GIFT, a method that uses gaze shift tracking to improve cross-modal attention in vision-language models, significantly reducing hallucinations and enhancing task accuracy.
Contribution
GIFT leverages visual attention gaze shifts to enhance cross-modal fusion, effectively mitigating hallucinations in vision-language models without high computational costs.
Findings
Up to 20.7% improvement in hallucination mitigation
Effective across generative and classification tasks
Maintains performance with low computational overhead
Abstract
Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The idea of using gaze shifts to dynamically adjust visual attention in VLMs is a novel and promising approach. It effectively addresses key challenges in cross-modal fusion and visual attention misallocation (visual attention sink), which are critical issues in VLM performance. 2. The paper provides extensive experiments that show GIFT achieves up to 20.7% improvement in hallucination mitigation, outperforming existing methods across several vision-language datasets and models of varying ar
1. Some formulas are missing concluding punctuation (e.g., periods at the end of equations). Sections 5 and 6 could be merged. Both sections discuss experimental results and analyses, and their separation feels redundant. Combining them into a single cohesive section would improve the flow and clarity of the paper. 2. The experiments in the paper are mainly focused on the LLaVA model, which limits the generalizability of the results. Although the authors show promising results for LLaVA, there
1. Clear and intuitive idea: The gaze shift concept is easy to understand and well-motivated by human visual attention behavior. 2. Addresses multiple issues simultaneously: Tackles visual attention sink, low visual contribution, and imbalanced cross-modal fusion, which existing methods often address in isolation. 3. Low computational overhead: Achieves improvements with modest runtime increase compared to greedy decoding. 4. Comprehensive evaluation: Experiments span multiple benchmarks, mod
1. Experimental setting is somewhat outdated: The chosen base VLMs (LLaVA-1.5 series and Qwen2-VL) were released over a year ago. More recent models—such as LLaVA-Next, InternVL—implement Dynamic High Resolution image processing, which could impact saliency computation. Testing the method on these architectures would strengthen claims about generality. 2. Limited hallucination benchmarks: Evaluation could include newer datasets such as HallusionBench or other recent challenging hallucination ta
1.GIFT introduces a human-inspired "gaze shift" tracking approach that addresses a critical gap in existing work: static attention averaging (used by baselines like VAF) often misallocates attention to irrelevant regions. 2.It integrates into existing VLMs without retraining, unlike training-based methods that incur high computational costs. 3.It consistently improves performance across diverse VLMs (LLaVA-1.5 7B/13B, Qwen2-VL 7B) and tasks (object detection, captioning, VQA), demonstrating it
1.GIFT heavily relies on "information-rich query tokens" (identified via POS tagging) to compute accurate saliency maps. The authors acknowledge that vague, ambiguous, or visually irrelevant queries (e.g., "Describe this image" without specific cues) may lead to inaccurate maps and reduced hallucination mitigation. However, they do not provide concrete strategies to handle such cases—e.g., no analysis of performance on low-specificity queries or a fallback mechanism for query-scarce scenarios.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
