Large Multimodal Models as General In-Context Classifiers
Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci

TL;DR
This paper demonstrates that large multimodal models (LMMs) can effectively perform in-context classification, often surpassing contrastive vision-language models (VLMs), especially in open-world scenarios, with the proposed CIRCLE method enhancing their robustness.
Contribution
The study reveals the in-context learning ability of LMMs for classification and introduces CIRCLE, a training-free method to improve open-world classification performance.
Findings
LMMs with in-context examples can match or surpass contrastive VLMs.
CIRCLE improves LMMs' robustness in open-world classification.
LMMs are viable as unified classifiers and flexible alternatives to specialized models.
Abstract
Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
