Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model

Dmitry Demidov; Zaigham Zaheer; Omkar Thawakar; Salman Khan; Fahad Shahbaz Khan

arXiv:2507.23070·cs.CV·December 23, 2025

Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model

Dmitry Demidov, Zaigham Zaheer, Omkar Thawakar, Salman Khan, Fahad Shahbaz Khan

PDF

Open Access

TL;DR

This paper introduces E-FineR, a training-free, vocabulary-free approach that leverages vision-language models for fine-grained visual recognition, achieving state-of-the-art results and high interpretability without requiring class labels or extensive annotations.

Contribution

The proposed E-FineR method is a training-free, vocabulary-free framework that enhances fine-grained recognition using vision-language models, enabling scalable, interpretable, and adaptable classification in open-set scenarios.

Findings

01

Achieves state-of-the-art results in fine-grained recognition

02

Performs competitively in zero-shot and few-shot classification

03

Eliminates need for training data and human annotations

Abstract

Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques