TL;DR
This paper introduces CAMA, a training-free attention modulation method that enhances multimodal in-context learning in large vision-language models by dynamically emphasizing important tokens, leading to improved performance across multiple benchmarks.
Contribution
The paper identifies weaknesses in LVLMs' self-attention mechanisms and proposes CAMA, a novel, plug-and-play approach that improves ICL by dynamically adjusting attention without additional training.
Findings
CAMA consistently outperforms vanilla models and baselines across four LVLMs and seven benchmarks.
CAMA enhances the effectiveness of prompt engineering methods.
CAMA remains robust across different sequence configurations.
Abstract
Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
