Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

Huan Liu; Lingyu Xiao; Jiangjiang Liu; Xiaofan Li; Ze Feng; Sen Yang,; Jingdong Wang

arXiv:2412.16418·cs.CV·December 24, 2024

Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang,, Jingdong Wang

PDF

Open Access

TL;DR

This paper thoroughly evaluates the image classification abilities of Multimodal Large Language Models (MLLMs), revealing they can match or outperform traditional vision-language models on various datasets, challenging previous assumptions.

Contribution

It provides an in-depth analysis of MLLMs' image classification performance, identifying key factors like architecture and training data that contribute to their success.

Findings

01

MLLMs match or outperform CLIP-style models on several datasets.

02

Advancements in language models and training data diversity drive improvements.

03

Analysis attributes success to conceptual knowledge transfer and exposure to target concepts.

Abstract

With the rapid advancement of Multimodal Large Language Models (MLLMs), a variety of benchmarks have been introduced to evaluate their capabilities. While most evaluations have focused on complex tasks such as scientific comprehension and visual reasoning, little attention has been given to assessing their fundamental image classification abilities. In this paper, we address this gap by thoroughly revisiting the MLLMs with an in-depth analysis of image classification. Specifically, building on established datasets, we examine a broad spectrum of scenarios, from general classification tasks (e.g., ImageNet, ObjectNet) to more fine-grained categories such as bird and food classification. Our findings reveal that the most recent MLLMs can match or even outperform CLIP-style vision-language models on several datasets, challenging the previous assumption that MLLMs are bad at image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Assessment and Pedagogy

MethodsSoftmax · Attention Is All You Need