Attention Guided Alignment in Efficient Vision-Language Models
Shweta Mahajan, Hoang Le, Hyojin Park, Farzad Farhadzadeh, Munawar Hayat, Fatih Porikli

TL;DR
This paper analyzes attention patterns in efficient vision-language models, identifies issues causing object hallucination, and introduces AGE-VLM, a new framework that improves visual grounding by integrating spatial knowledge from SAM, leading to better alignment and reduced hallucination.
Contribution
We propose AGE-VLM, a novel framework that enhances visual grounding in efficient VLMs by using interleaved cross-attention layers and spatial knowledge from SAM, addressing hallucination issues.
Findings
Our analysis reveals concatenation-based architectures often fail to distinguish matching image-text pairs.
AGE-VLM significantly reduces object hallucination in vision-language models.
The proposed method achieves better or comparable results on vision-centric benchmarks.
Abstract
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs) to integrate visual and textual information. This paper presents a comprehensive analysis of attention patterns in efficient VLMs, revealing that concatenation-based architectures frequently fail to distinguish between semantically matching and non-matching image-text pairs. This is a key factor for object hallucination in these models. To address this, we introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers to instill vision capabilities in pretrained small language models. This enforces in VLM the ability "look" at the correct image regions by leveraging spatial knowledge distilled from the Segment Anything Model (SAM), significantly…
Peer Reviews
Decision·Submitted to ICLR 2026
- The similarity-distribution study nicely evidences why concatenation architectures blur matching vs. non-matching pairs—useful and reproducible diagnostic - Distilling SAM masks into cross-attention (not the vision backbone) is conceptually clean and data-efficient; the dice loss formulation is appropriate for sparse regions - Interleaving cross-attention in a 1B LLM while mostly freezing self-attention preserves language priors and keeps training economical; the staged plan is easy to adopt -
- Evaluation protocol introduces an external judge: For CV-Bench, accuracy is computed by Qwen-L because models sometimes omit option letters, this can inject evaluator bias and hides raw option-selection accuracy - Heavy reliance on SAM/Grounded-SAM: Performance hinges on third-party segmentation quality and prompt engineering; generalization when masks are imperfect/noisy is not stress-tested (authors acknowledge broader-impact limits) - There is a CA-baseline (cross-attention without guidance
1. The paper discusses a relevant problem for efficient vision–language models. 2. The proposed SAM-guided cross-attention design is interpretable and integrates spatial grounding signals into a lightweight architecture without significant computational costs.
1. The SAM-guided cross-attention mechanism is relatively straightforward and appears as a simple extension of prior attention-alignment and grounding approaches. 2. The reported improvements do not look significant, and the method does not outperform the baseline by a considerable margin. 3. The role of each training stage and the specific contribution of the SAM supervision are not clearly disentangled, making it difficult to assess which component drives the observed gains.
1. The paper tackles an important problem of object hallucinations in multimodal models and proposes a novel idea to guide attention to focus on relevant areas of the image using text grounded segmentation masks. 2. The experimental setup is well formulated including multiple stages of pre-training and instruction fine tuning with the introduction of segmentation grounded loss in some stages. Specific focus is applied on maintaining language modeling performance. 3. The results are well presente
1. Section 3.1 computes cosine similarity between final-layer hidden states at image-token vs text-token positions on matched and mismatched pairs, but it doesn’t clearly justify why hidden states are used instead of similarities in Q/K-space (Eg: cos(W_{q}h_{t}, W_{k}h_{v}) ) which drive attention. It is unclear whether earlier/middle layers exhibit different alignment. How multiple tokens per modality are reduced is also not mentioned. 2. Segmentation grounding loss is applied to 10% of the sa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
