Enhancing Fine-Grained Image Classifications via Cascaded Vision   Language Models

Canshi Wei

arXiv:2405.11301·cs.CL·May 21, 2024

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

Canshi Wei

PDF

Open Access

TL;DR

This paper presents CascadeVLM, a new framework that enhances fine-grained image classification by effectively leveraging large vision-language models, significantly improving zero-shot accuracy over existing CLIP-based methods.

Contribution

Introduces CascadeVLM, a novel approach that overcomes CLIP limitations by utilizing granular knowledge from LVLMs for better fine-grained classification.

Findings

01

Achieves 85.6% zero-shot accuracy on Stanford Cars dataset.

02

LVLMs provide more accurate predictions for challenging images.

03

CascadeVLM outperforms existing models in fine-grained classification.

Abstract

Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-language models (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-language models (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training