FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan; Ilker Kesen; Iacer Calixto; Aykut Erdem; Erkut Erdem

arXiv:2602.21854·cs.CL·February 26, 2026

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem

PDF

Open Access 1 Datasets

TL;DR

FewMMBench is a new benchmark designed to evaluate the few-shot learning abilities of multimodal large language models across diverse tasks, revealing insights into their performance with different prompting strategies and model types.

Contribution

Introduces FewMMBench, a comprehensive benchmark for assessing multimodal LLMs in few-shot scenarios, including diverse tasks and prompting methods, with extensive evaluation of 26 models.

Findings

01

Instruction-tuned models perform well zero-shot but show limited or negative gains with few-shot or CoT prompts.

02

Retrieval-based demonstrations and larger context sizes provide minimal improvements.

03

FewMMBench serves as a rigorous tool for diagnosing and improving multimodal few-shot learning.

Abstract

As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mustafaa/FewMMBench
dataset· 75 dl
75 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling