Vocabulary-free Image Classification
Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota,, Yiming Wang, Elisa Ricci

TL;DR
This paper introduces Vocabulary-free Image Classification (VIC), a new task that classifies images into an open-ended semantic space without predefined categories, using external databases and vision-language models.
Contribution
It formalizes VIC, proposes CaSED, a training-free method leveraging external databases and vision-language models, and demonstrates its effectiveness over existing frameworks.
Findings
CaSED outperforms other vision-language frameworks on benchmarks.
Using external databases effectively captures relevant semantic content.
The approach is efficient with fewer parameters.
Abstract
Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
