What Makes Multimodal In-Context Learning Work?

Folco Bertini Baldassini; Mustafa Shukor; Matthieu Cord; Laure; Soulier; Benjamin Piwowarski

arXiv:2404.15736·cs.CV·April 26, 2024·1 cites

What Makes Multimodal In-Context Learning Work?

Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure, Soulier, Benjamin Piwowarski

PDF

Open Access 1 Repo

TL;DR

This paper investigates the effectiveness of multimodal in-context learning in large models, revealing that it mainly depends on text and is often no better than simple voting strategies, highlighting biases and limitations.

Contribution

It provides a comprehensive analysis of multimodal in-context learning, showing its reliance on text and comparing advanced strategies with simple voting, which is a novel insight.

Findings

01

M-ICL mainly relies on text-driven mechanisms.

02

Advanced-ICL strategies do not outperform majority voting.

03

Identifies biases and limitations in M-ICL.

Abstract

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/folbaeni/multimodal-icl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEFL/ESL Teaching and Learning · Digital Storytelling and Education · Discourse Analysis in Language Studies