Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion
Philipp Allgeuer, Kyra Ahrens, Stefan Wermter

TL;DR
NOVIC is a real-time, unconstrained open vocabulary image classifier that uses a transformer to generate labels from images without predefined class lists, enabling zero-shot transfer from text to images.
Contribution
It introduces an object decoder trained on large datasets to invert CLIP embeddings, allowing label generation directly from image features without prior class knowledge.
Findings
Achieves up to 87.5% accuracy on standard benchmarks.
Operates without predefined class labels or context.
Supports zero-shot transfer from text to image classification.
Abstract
We introduce NOVIC, an innovative real-time uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an "object decoder" model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels from essentially the entire English language to be generated directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Softmax · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer · Contrastive Language-Image Pre-training
