Rethinking VLMs and LLMs for Image Classification

Avi Cooper; Keizo Kato; Chia-Hsien Shih; Hiroaki Yamane; Kasper; Vinken; Kentaro Takemoto; Taro Sunagawa; Hao-Wei Yeh; Jin Yamanaka; Ian; Mason; Xavier Boix

arXiv:2410.14690·cs.LG·October 22, 2024

Rethinking VLMs and LLMs for Image Classification

Avi Cooper, Keizo Kato, Chia-Hsien Shih, Hiroaki Yamane, Kasper, Vinken, Kentaro Takemoto, Taro Sunagawa, Hao-Wei Yeh, Jin Yamanaka, Ian, Mason, Xavier Boix

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of combining Visual Language Models with Large Language Models for image classification, finding that non-LLM VLMs often outperform LLM-enhanced models in recognition tasks, but LLMs aid reasoning tasks.

Contribution

It introduces a lightweight LLM-based routing method that directs visual tasks to the most suitable model, improving efficiency and accuracy over existing approaches.

Findings

01

Non-LLM VLMs outperform LLM-augmented VLMs in recognition tasks.

02

Leveraging LLMs enhances performance on reasoning and knowledge-based tasks.

03

The proposed lightweight LLM router surpasses or matches state-of-the-art accuracy while being cost-effective.

Abstract

Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable capabilities, the contribution of LLMs to enhancing the longstanding key problem of classifying an image among a set of choices remains unclear. Through extensive experiments involving seven models, ten visual understanding datasets, and multiple prompt variations per dataset, we find that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. Yet at the same time, leveraging LLMs can improve performance on tasks requiring reasoning and outside knowledge. In response to these challenges, we propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Computational Techniques and Applications

MethodsSparse Evolutionary Training