Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning

Yanshu Li; Jianjiang Yang; Ziteng Yang; Bozheng Li; Ligong Han; Hongyang He; Zhengtao Yao; Yingjie Victor Chen; Songlin Fei; Dongfang Liu; Ruixiang Tang

arXiv:2505.17097·cs.CV·December 11, 2025

Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning

Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang

PDF

1 Video

TL;DR

This paper introduces CAMA, a training-free attention modulation method that enhances multimodal in-context learning in large vision-language models by dynamically emphasizing important tokens, leading to improved performance across multiple benchmarks.

Contribution

The paper identifies weaknesses in LVLMs' self-attention mechanisms and proposes CAMA, a novel, plug-and-play approach that improves ICL by dynamically adjusting attention without additional training.

Findings

01

CAMA consistently outperforms vanilla models and baselines across four LVLMs and seven benchmarks.

02

CAMA enhances the effectiveness of prompt engineering methods.

03

CAMA remains robust across different sequence configurations.

Abstract

Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning· underline