UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval
Hongyu Guo, Xiangzhao Hao, Jiarui Guo, Haiyun Guo, Jinqiao Wang, Tat-Seng Chua

TL;DR
UniFGVC introduces a training-free, multimodal retrieval framework for few-shot fine-grained visual classification, leveraging large language models to generate attribute-aware descriptions, enabling effective discrimination without fine-tuning.
Contribution
It proposes a novel, training-free approach using multimodal retrieval and attribute-aware captioning with large language models for few-shot FGVC.
Findings
Outperforms prior CLIP-based few-shot methods on 12 benchmarks.
Demonstrates broad compatibility with various multimodal large language models.
Achieves reliable generalization across diverse few-shot FGVC scenarios.
Abstract
Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
