Large Multimodal Models as General In-Context Classifiers

Marco Garosi; Matteo Farina; Alessandro Conti; Massimiliano Mancini; Elisa Ricci

arXiv:2602.23229·cs.CV·February 27, 2026

Large Multimodal Models as General In-Context Classifiers

Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci

PDF

Open Access

TL;DR

This paper demonstrates that large multimodal models (LMMs) can effectively perform in-context classification, often surpassing contrastive vision-language models (VLMs), especially in open-world scenarios, with the proposed CIRCLE method enhancing their robustness.

Contribution

The study reveals the in-context learning ability of LMMs for classification and introduces CIRCLE, a training-free method to improve open-world classification performance.

Findings

01

LMMs with in-context examples can match or surpass contrastive VLMs.

02

CIRCLE improves LMMs' robustness in open-world classification.

03

LMMs are viable as unified classifiers and flexible alternatives to specialized models.

Abstract

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling