What do vision-language models see in the context? Investigating multimodal in-context learning
Gabriel O. dos Santos, Esther Colombini, Sandra Avila

TL;DR
This paper systematically investigates in-context learning in vision-language models, revealing limitations in multimodal integration and highlighting the effects of training strategies and attention patterns on model performance.
Contribution
It is the first comprehensive analysis of ICL in VLMs, examining architectural, training, and attention factors affecting multimodal in-context learning.
Findings
Training on imag-text interleaved data improves ICL performance.
Instruction tuning enhances instruction-following but reduces reliance on demonstrations.
Current VLMs mainly focus on textual cues, limiting multimodal integration.
Abstract
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
