Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

Dmitry Demidov; Zaigham Zaheer; Zongyan Han; Omkar Thawakar; Rao Anwer

arXiv:2512.18897·cs.CV·February 27, 2026

Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer

PDF

Open Access

TL;DR

This paper introduces FiNDR, a reasoning-augmented large multimodal model framework for vocabulary-free fine-grained image recognition, achieving state-of-the-art results without relying on predefined label vocabularies.

Contribution

It presents the first reasoning-augmented LMM approach for vocabulary-free fine-grained recognition, enabling automated label generation and surpassing previous methods.

Findings

01

Achieves up to 18.8% improvement over previous approaches.

02

Outperforms zero-shot baselines with human-curated prompts.

03

Establishes reasoning-augmented LMMs as effective for open-world recognition.

Abstract

Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning