Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models
Canshi Wei

TL;DR
This paper presents CascadeVLM, a new framework that enhances fine-grained image classification by effectively leveraging large vision-language models, significantly improving zero-shot accuracy over existing CLIP-based methods.
Contribution
Introduces CascadeVLM, a novel approach that overcomes CLIP limitations by utilizing granular knowledge from LVLMs for better fine-grained classification.
Findings
Achieves 85.6% zero-shot accuracy on Stanford Cars dataset.
LVLMs provide more accurate predictions for challenging images.
CascadeVLM outperforms existing models in fine-grained classification.
Abstract
Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-language models (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-language models (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
