Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of   Multimodal LLMs

Hari Chandana Kuchibhotla; Sai Srinivas Kancheti; Abbavaram Gowtham; Reddy; Vineeth N Balasubramanian

arXiv:2505.01064·cs.CV·May 5, 2025

Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

Hari Chandana Kuchibhotla, Sai Srinivas Kancheti, Abbavaram Gowtham, Reddy, Vineeth N Balasubramanian

PDF

Open Access

TL;DR

This paper introduces NeaR, a method that fine-tunes CLIP models for vocabulary-free fine-grained visual recognition by leveraging labels generated from multimodal large language models, enabling efficient recognition without prior labels.

Contribution

The paper proposes NeaR, a novel label refinement approach that constructs weakly supervised datasets from unlabeled data using MLLMs, addressing the challenge of vocabulary-free FGVR.

Findings

01

NeaR effectively handles noisy labels from MLLMs.

02

NeaR establishes new benchmarks for vocabulary-free FGVR.

03

The approach reduces reliance on large annotated datasets.

Abstract

Fine-grained Visual Recognition (FGVR) involves distinguishing between visually similar categories, which is inherently challenging due to subtle inter-class differences and the need for large, expert-annotated datasets. In domains like medical imaging, such curated datasets are unavailable due to issues like privacy concerns and high annotation costs. In such scenarios lacking labeled data, an FGVR model cannot rely on a predefined set of training labels, and hence has an unconstrained output space for predictions. We refer to this task as Vocabulary-Free FGVR (VF-FGVR), where a model must predict labels from an unconstrained output space without prior label information. While recent Multimodal Large Language Models (MLLMs) show potential for VF-FGVR, querying these models for each test input is impractical because of high costs and prohibitive inference times. To address these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training