Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities
Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang,, Jingdong Wang

TL;DR
This paper thoroughly evaluates the image classification abilities of Multimodal Large Language Models (MLLMs), revealing they can match or outperform traditional vision-language models on various datasets, challenging previous assumptions.
Contribution
It provides an in-depth analysis of MLLMs' image classification performance, identifying key factors like architecture and training data that contribute to their success.
Findings
MLLMs match or outperform CLIP-style models on several datasets.
Advancements in language models and training data diversity drive improvements.
Analysis attributes success to conceptual knowledge transfer and exposure to target concepts.
Abstract
With the rapid advancement of Multimodal Large Language Models (MLLMs), a variety of benchmarks have been introduced to evaluate their capabilities. While most evaluations have focused on complex tasks such as scientific comprehension and visual reasoning, little attention has been given to assessing their fundamental image classification abilities. In this paper, we address this gap by thoroughly revisiting the MLLMs with an in-depth analysis of image classification. Specifically, building on established datasets, we examine a broad spectrum of scenarios, from general classification tasks (e.g., ImageNet, ObjectNet) to more fine-grained categories such as bird and food classification. Our findings reveal that the most recent MLLMs can match or even outperform CLIP-style vision-language models on several datasets, challenging the previous assumption that MLLMs are bad at image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Assessment and Pedagogy
MethodsSoftmax · Attention Is All You Need
