What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Abdelrahman Abdelhamed; Mahmoud Afifi; Alec Go

arXiv:2405.15668·cs.CV·June 27, 2025

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Abdelrahman Abdelhamed, Mahmoud Afifi, Alec Go

PDF

Open Access 1 Repo 5 Reviews

TL;DR

This paper introduces a straightforward approach using multimodal large language models to improve zero-shot image classification accuracy across multiple datasets without dataset-specific prompt engineering.

Contribution

The paper presents a novel method leveraging multimodal LLMs for zero-shot image classification that outperforms existing benchmarks without dataset-specific prompts.

Findings

01

Achieved an average accuracy gain of 6.2 percentage points across ten benchmarks.

02

Surpassed benchmark accuracy on multiple datasets, including a 6.8 point increase on ImageNet.

03

Demonstrated the effectiveness of multimodal LLMs in enhancing zero-shot vision tasks.

Abstract

Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs. Using multimodal LLMs, we generate comprehensive textual representations from input images. These textual representations are then utilized to generate fixed-dimensional features in a cross-modal embedding space. Subsequently, these features are fused together to perform zero-shot classification using a linear classifier. Our method does not require prompt engineering for each dataset; instead, we use a single, straightforward set of prompts across all datasets. We evaluated our method on several datasets and our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets. On average, for ten benchmarks, our method…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 2

Strengths

The proposed method introduces a multimodal LLM-based framework to improve zero-shot image classification by integrating image descriptions and initial class predictions, which is relatively novel and improves on the limitations of using visual features alone.

Weaknesses

The method’s reliance on a multimodal LLM (e.g., Gemini) may restrict its flexibility, as it depends on the capabilities and availability of specific LLMs, which could become a bottleneck.

Reviewer 02Rating 3Confidence 4

Strengths

- It is reasonable to utilize pretrained foundation models to solve downstream tasks. - Leveraging generated textual description in visual recognition is a research theme in recent years.

Weaknesses

- As the paper concerns about "enhancing zero-shot image classification with multimodal large language models" (as suggested in the title), it fails to compare other zero-shot image classification methods which especially leverage pretrained foundation models and their pretraining data, such as [R1-R3]. - The setting is a more serious issue. While datasets used in experiments are standard ones for zero-shot recognition in the literature, images in these datasets are collected on the web, from w

Reviewer 03Rating 6Confidence 5

Strengths

- The paper is written very well, effectively communicating the method description and experiments in a concise and easy to understand way. - The proposed approach shows significant improvement on almost every dataset traditionally used for benchmarking zero-shot image classification. - It appears to be the first work using multimodal capabilities of MLLMs for this task, which is a natural next step from text-based features used in previous LLM methods for this task.

Weaknesses

- The discussion of the results could be improved. For example, in lines 413-420, the results of the ablation study in Table 3 are described. But there is no discussion to explain the results. Particularly, line 420 says averaging features leads to best performance, but it would be better if the authors also provide a likely reason for why other fusion methods do not work. - In Table 2 caption, the authors state they used the combined class features for all dataset except Caltech where class des

Reviewer 04Rating 5Confidence 5

Strengths

The method proposed is simple but effective. The paper is well presented where the method is presented with clarity. Ablations of the method has also been conducted for various contributing factors. Combination of various prompting strategies to the CLIP model for classification and their effects is also a useful contribution.

Weaknesses

Missing datasets : Some datasets which are utilized by previous methods such as CUB, FGVC Aircraft, EuroSAT have been overlooked. Is it the case that the multimodal LLM does not possess useful knowledge about these fine-grained datasets and thus cannot improve performance? Missing baseline : A useful baseline in Table 1 would be the accuracy of the Multimodal LLM in classifying the images. This result will give an insight into the improvements that combining the multimodal LLM and CLIP is bring

Reviewer 05Rating 3Confidence 5

Strengths

This is a complete work and proposes to enhance the classification accuracy via introducing richer textual information generated via LLMs. The proposed method employs a simple and universal set of prompts and free from the task of dataset-specific prompt engineering. The proposed method is experimentally validated.

Weaknesses

The method introduced in this paper is quite simple. It simple combine the textual and visual inputs together to prompt the accuracy. This work is lack of theoretical analysis. Why and how much the proposed method can enhance the accuracy? The experimental section is lack of implementation details.

Code & Models

Repositories

donatoaz/what-do-you-see-zero-shot-image-classification-multimodal-llm
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Topic Modeling

MethodsSparse Evolutionary Training