From Introspection to Best Practices: Principled Analysis of   Demonstrations in Multimodal In-Context Learning

Nan Xu; Fei Wang; Sheng Zhang; Hoifung Poon; Muhao Chen

arXiv:2407.00902·cs.CV·February 10, 2025

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

Nan Xu, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen

PDF

Open Access 1 Video

TL;DR

This paper systematically analyzes how and why multimodal in-context learning works in large language models, revealing the importance of modality impact and proposing strategies to improve performance across diverse tasks.

Contribution

It provides a principled evaluation of multimodal ICL, uncovering modality-specific effects, and offers demonstration strategies to enhance model performance based on task-specific modality importance.

Findings

01

Modalities impact differently across tasks in multimodal ICL.

02

Modality-driven demonstration strategies improve ICL performance.

03

Models exhibit inductive biases influenced by multimodal ICL, even contradicting pretraining priors.

Abstract

Motivated by in-context learning (ICL) capabilities of Large Language Models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance. We also find that models may follow inductive biases from multimodal ICL even if they are rarely seen in or contradict semantic priors from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning· underline

Taxonomy

TopicsDiscourse Analysis in Language Studies