Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

Zheng Qi; Chao Shang; Evangelia Spiliopoulou; Nikolaos Pappas

arXiv:2510.22067·cs.CV·November 11, 2025

Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas

PDF

3 Reviews

TL;DR

This paper introduces GIFT, a method that uses gaze shift tracking to improve cross-modal attention in vision-language models, significantly reducing hallucinations and enhancing task accuracy.

Contribution

GIFT leverages visual attention gaze shifts to enhance cross-modal fusion, effectively mitigating hallucinations in vision-language models without high computational costs.

Findings

01

Up to 20.7% improvement in hallucination mitigation

02

Effective across generative and classification tasks

03

Maintains performance with low computational overhead

Abstract

Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The idea of using gaze shifts to dynamically adjust visual attention in VLMs is a novel and promising approach. It effectively addresses key challenges in cross-modal fusion and visual attention misallocation (visual attention sink), which are critical issues in VLM performance. 2. The paper provides extensive experiments that show GIFT achieves up to 20.7% improvement in hallucination mitigation, outperforming existing methods across several vision-language datasets and models of varying ar

Weaknesses

1. Some formulas are missing concluding punctuation (e.g., periods at the end of equations). Sections 5 and 6 could be merged. Both sections discuss experimental results and analyses, and their separation feels redundant. Combining them into a single cohesive section would improve the flow and clarity of the paper. 2. The experiments in the paper are mainly focused on the LLaVA model, which limits the generalizability of the results. Although the authors show promising results for LLaVA, there

Reviewer 02Rating 4Confidence 4

Strengths

1. Clear and intuitive idea: The gaze shift concept is easy to understand and well-motivated by human visual attention behavior. 2. Addresses multiple issues simultaneously: Tackles visual attention sink, low visual contribution, and imbalanced cross-modal fusion, which existing methods often address in isolation. 3. Low computational overhead: Achieves improvements with modest runtime increase compared to greedy decoding. 4. Comprehensive evaluation: Experiments span multiple benchmarks, mod

Weaknesses

1. Experimental setting is somewhat outdated: The chosen base VLMs (LLaVA-1.5 series and Qwen2-VL) were released over a year ago. More recent models—such as LLaVA-Next, InternVL—implement Dynamic High Resolution image processing, which could impact saliency computation. Testing the method on these architectures would strengthen claims about generality. 2. Limited hallucination benchmarks: Evaluation could include newer datasets such as HallusionBench or other recent challenging hallucination ta

Reviewer 03Rating 4Confidence 3

Strengths

1.GIFT introduces a human-inspired "gaze shift" tracking approach that addresses a critical gap in existing work: static attention averaging (used by baselines like VAF) often misallocates attention to irrelevant regions. 2.It integrates into existing VLMs without retraining, unlike training-based methods that incur high computational costs. 3.It consistently improves performance across diverse VLMs (LLaVA-1.5 7B/13B, Qwen2-VL 7B) and tasks (object detection, captioning, VQA), demonstrating it

Weaknesses

1.GIFT heavily relies on "information-rich query tokens" (identified via POS tagging) to compute accurate saliency maps. The authors acknowledge that vague, ambiguous, or visually irrelevant queries (e.g., "Describe this image" without specific cues) may lead to inaccurate maps and reduced hallucination mitigation. However, they do not provide concrete strategies to handle such cases—e.g., no analysis of performance on low-specificity queries or a fallback mechanism for query-scarce scenarios.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.