VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning
Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales

TL;DR
This paper introduces VL-ICL Bench, a comprehensive benchmark for evaluating multimodal in-context learning in vision-language models across diverse tasks, revealing their strengths and limitations and guiding future improvements.
Contribution
The paper presents a new benchmark for multimodal ICL, covering a wide range of tasks and challenges, and evaluates state-of-the-art VLLMs to identify their capabilities and gaps.
Findings
Even advanced models like GPT-4 struggle with the tasks.
VLLMs show diverse strengths and weaknesses across tasks.
The benchmark highlights limitations in current multimodal ICL capabilities.
Abstract
Large language models (LLMs) famously exhibit emergent in-context learning (ICL) -- the ability to rapidly adapt to new tasks using few-shot examples provided as a prompt, without updating the model's weights. Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding. However, investigations into \emph{multimodal ICL} have predominantly focused on few-shot visual question answering (VQA), and image captioning, which we will show neither exploit the strengths of ICL, nor test its limitations. The broader capabilities and limitations of multimodal ICL remain under-explored. In this study, we introduce a comprehensive benchmark VL-ICL Bench for multimodal in-context learning, encompassing a broad spectrum of tasks that involve both images and text as inputs and outputs, and different types of challenges,…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper starts from a good motivation for evaluating the ICL ability of current multimodal models. 2. This paper proposes the VL-ICL bench convering 10 tasks to evaluate the diverse capacibilities such as perception, reasoning, rule-induction. 3. The authors conduct extensive and thorough experiments with the current multimodal models on the proposed benchmark
1. Some details about the construction of VL-ICL should be clarified. For the datasets used in Table 1, do you use all the samples from the original sources? Do you perform some filtering strategies? 2. Could the authors give more explanations about the metric ICL efficiency? Why the ICL efficiency has negative numbers in Table 2? 3. Based on the curve figures (e.g., Figure 5 and Figure 6), it appears that the multimodal models do not significantly benefit from additional ICL examples, with perf
+ The overall paper is well organized and easy to follow. + The research problem (i.e., benchmarking the multimodal ICL capabilities of VLLMs) is quite valuable in VLLM communities. Comprehensive analysis experiments are provided to show the limitations of existing benchmarks. + The introduced benchmark covers multiple practical ICL tasks and assesses numerous VLLMs. Multiple discussions are provided to show the promising direction for future research.
- Some texts are not consistent with the figures. For instance, in Line 159-160 and 224-225, the authors claim that the ICL exhibit more significant improvement on text-to-text benchmarks compared to image-to-text benchmarks. However, the differences between Figures 3a and 4 are marginal. These line charts show similar trends. - Some design choices are not clear. The authors want to show different trends of VLLMs on multimodal and LLM benchmarks in Figure 3 and 4. I am wondering why different m
1. Point out the limitations of the common practice of quantitatively evaluating VLLM ICL through VQA and image captioning. 2. Propose a comprehensive benchmark suite of ICL tasks covering diverse challenges, including perception, reasoning, and so on. 3. It rigorously evaluates a range of state-of-the-art VLLMs on the benchmark suite and highlights their diverse strengths and weaknesses.
1. The evaluation seems too weak. For example, for the ICL tasks of image generation, the community might focus more on generating images based on complex instructions. For example, researchers in such fields prefer using VLMs to evaluate the generated images given complex instructions as described in [1]. 2. The possible usage of this model is still unclear. There are two situations when we want to evaluate multi-modal tasks, but I think the proposed benchmark is not suitable for any situation.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterpreting and Communication in Healthcare · EFL/ESL Teaching and Learning
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Softmax · Dropout · Byte Pair Encoding · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer
