Multimodal Large Language Models as Image Classifiers

Nikita Kisel; Illia Volkov; Klara Janouskova; Jiri Matas

arXiv:2603.06578·cs.CV·March 10, 2026

Multimodal Large Language Models as Image Classifiers

Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas

PDF

Open Access

TL;DR

This paper critically examines evaluation protocols for Multimodal Large Language Models (MLLMs), identifies key issues affecting performance assessment, and demonstrates that corrected evaluation and label quality significantly improve perceived accuracy and utility in image classification tasks.

Contribution

The paper identifies flaws in current evaluation protocols for MLLMs, proposes fixes, and shows that improved evaluation and label quality reveal better performance and potential applications.

Findings

01

Corrected evaluation protocols increase MLLM accuracy by up to 10.8%.

02

Flawed evaluation protocols have led to underestimating MLLMs' true capabilities.

03

MLLMs can effectively assist human annotators in dataset curation.

Abstract

Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Computational and Text Analysis Methods · Topic Modeling