Towards Multimodal In-Context Learning for Vision & Language Models

Sivan Doveh; Shaked Perek; M. Jehanzeb Mirza; Wei Lin; Amit Alfassy,; Assaf Arbelle; Shimon Ullman; Leonid Karlinsky

arXiv:2403.12736·cs.CV·July 18, 2024·1 cites

Towards Multimodal In-Context Learning for Vision & Language Models

Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy,, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky

PDF

Open Access

TL;DR

This paper investigates the limited in-context learning capabilities of current vision-language models and proposes a curriculum-based training method to significantly improve their performance on ICL tasks, supported by new benchmarks.

Contribution

It introduces a simple curriculum-based training approach that enhances ICL abilities in vision-language models and provides new benchmarks for evaluating ICL performance.

Findings

01

21.03% ICL performance boost over baselines

02

Effective data mixes improve ICL abilities

03

New benchmarks for ICL evaluation in VLMs

Abstract

State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques