Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
Dmitry Demidov, Zaigham Zaheer, Omkar Thawakar, Salman Khan, Fahad Shahbaz Khan

TL;DR
This paper introduces E-FineR, a training-free, vocabulary-free approach that leverages vision-language models for fine-grained visual recognition, achieving state-of-the-art results and high interpretability without requiring class labels or extensive annotations.
Contribution
The proposed E-FineR method is a training-free, vocabulary-free framework that enhances fine-grained recognition using vision-language models, enabling scalable, interpretable, and adaptable classification in open-set scenarios.
Findings
Achieves state-of-the-art results in fine-grained recognition
Performs competitively in zero-shot and few-shot classification
Eliminates need for training data and human annotations
Abstract
Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
