What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
Abdelrahman Abdelhamed, Mahmoud Afifi, Alec Go

TL;DR
This paper introduces a straightforward approach using multimodal large language models to improve zero-shot image classification accuracy across multiple datasets without dataset-specific prompt engineering.
Contribution
The paper presents a novel method leveraging multimodal LLMs for zero-shot image classification that outperforms existing benchmarks without dataset-specific prompts.
Findings
Achieved an average accuracy gain of 6.2 percentage points across ten benchmarks.
Surpassed benchmark accuracy on multiple datasets, including a 6.8 point increase on ImageNet.
Demonstrated the effectiveness of multimodal LLMs in enhancing zero-shot vision tasks.
Abstract
Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs. Using multimodal LLMs, we generate comprehensive textual representations from input images. These textual representations are then utilized to generate fixed-dimensional features in a cross-modal embedding space. Subsequently, these features are fused together to perform zero-shot classification using a linear classifier. Our method does not require prompt engineering for each dataset; instead, we use a single, straightforward set of prompts across all datasets. We evaluated our method on several datasets and our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets. On average, for ten benchmarks, our method…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The proposed method introduces a multimodal LLM-based framework to improve zero-shot image classification by integrating image descriptions and initial class predictions, which is relatively novel and improves on the limitations of using visual features alone.
The method’s reliance on a multimodal LLM (e.g., Gemini) may restrict its flexibility, as it depends on the capabilities and availability of specific LLMs, which could become a bottleneck.
- It is reasonable to utilize pretrained foundation models to solve downstream tasks. - Leveraging generated textual description in visual recognition is a research theme in recent years.
- As the paper concerns about "enhancing zero-shot image classification with multimodal large language models" (as suggested in the title), it fails to compare other zero-shot image classification methods which especially leverage pretrained foundation models and their pretraining data, such as [R1-R3]. - The setting is a more serious issue. While datasets used in experiments are standard ones for zero-shot recognition in the literature, images in these datasets are collected on the web, from w
- The paper is written very well, effectively communicating the method description and experiments in a concise and easy to understand way. - The proposed approach shows significant improvement on almost every dataset traditionally used for benchmarking zero-shot image classification. - It appears to be the first work using multimodal capabilities of MLLMs for this task, which is a natural next step from text-based features used in previous LLM methods for this task.
- The discussion of the results could be improved. For example, in lines 413-420, the results of the ablation study in Table 3 are described. But there is no discussion to explain the results. Particularly, line 420 says averaging features leads to best performance, but it would be better if the authors also provide a likely reason for why other fusion methods do not work. - In Table 2 caption, the authors state they used the combined class features for all dataset except Caltech where class des
The method proposed is simple but effective. The paper is well presented where the method is presented with clarity. Ablations of the method has also been conducted for various contributing factors. Combination of various prompting strategies to the CLIP model for classification and their effects is also a useful contribution.
Missing datasets : Some datasets which are utilized by previous methods such as CUB, FGVC Aircraft, EuroSAT have been overlooked. Is it the case that the multimodal LLM does not possess useful knowledge about these fine-grained datasets and thus cannot improve performance? Missing baseline : A useful baseline in Table 1 would be the accuracy of the Multimodal LLM in classifying the images. This result will give an insight into the improvements that combining the multimodal LLM and CLIP is bring
This is a complete work and proposes to enhance the classification accuracy via introducing richer textual information generated via LLMs. The proposed method employs a simple and universal set of prompts and free from the task of dataset-specific prompt engineering. The proposed method is experimentally validated.
The method introduced in this paper is quite simple. It simple combine the textual and visual inputs together to prompt the accuracy. This work is lack of theoretical analysis. Why and how much the proposed method can enhance the accuracy? The experimental section is lack of implementation details.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Topic Modeling
MethodsSparse Evolutionary Training
