Towards Multimodal In-Context Learning for Vision & Language Models
Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy,, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky

TL;DR
This paper investigates the limited in-context learning capabilities of current vision-language models and proposes a curriculum-based training method to significantly improve their performance on ICL tasks, supported by new benchmarks.
Contribution
It introduces a simple curriculum-based training approach that enhances ICL abilities in vision-language models and provides new benchmarks for evaluating ICL performance.
Findings
21.03% ICL performance boost over baselines
Effective data mixes improve ICL abilities
New benchmarks for ICL evaluation in VLMs
Abstract
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
