Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su,, Ludwig Schmidt, Serena Yeung-Levy

TL;DR
This paper investigates why visually-grounded language models underperform in image classification tasks, identifying data limitations as the main cause, and demonstrates that training with more class-specific data significantly improves their accuracy.
Contribution
The study reveals that data exposure during training is crucial for VLMs' image classification performance and shows how integrating classification data enhances their capabilities.
Findings
VLMs underperform compared to CLIP on ImageNet.
Performance correlates with class exposure during training.
Adding classification data improves VLM accuracy by 11.8%."],
Abstract
Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
