Rethinking VLMs and LLMs for Image Classification
Avi Cooper, Keizo Kato, Chia-Hsien Shih, Hiroaki Yamane, Kasper, Vinken, Kentaro Takemoto, Taro Sunagawa, Hao-Wei Yeh, Jin Yamanaka, Ian, Mason, Xavier Boix

TL;DR
This paper evaluates the effectiveness of combining Visual Language Models with Large Language Models for image classification, finding that non-LLM VLMs often outperform LLM-enhanced models in recognition tasks, but LLMs aid reasoning tasks.
Contribution
It introduces a lightweight LLM-based routing method that directs visual tasks to the most suitable model, improving efficiency and accuracy over existing approaches.
Findings
Non-LLM VLMs outperform LLM-augmented VLMs in recognition tasks.
Leveraging LLMs enhances performance on reasoning and knowledge-based tasks.
The proposed lightweight LLM router surpasses or matches state-of-the-art accuracy while being cost-effective.
Abstract
Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable capabilities, the contribution of LLMs to enhancing the longstanding key problem of classifying an image among a set of choices remains unclear. Through extensive experiments involving seven models, ten visual understanding datasets, and multiple prompt variations per dataset, we find that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. Yet at the same time, leveraging LLMs can improve performance on tasks requiring reasoning and outside knowledge. In response to these challenges, we propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Computational Techniques and Applications
MethodsSparse Evolutionary Training
