Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer   from Text to Image via CLIP Inversion

Philipp Allgeuer; Kyra Ahrens; Stefan Wermter

arXiv:2407.11211·cs.CV·April 15, 2025

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

Philipp Allgeuer, Kyra Ahrens, Stefan Wermter

PDF

Open Access 2 Repos

TL;DR

NOVIC is a real-time, unconstrained open vocabulary image classifier that uses a transformer to generate labels from images without predefined class lists, enabling zero-shot transfer from text to images.

Contribution

It introduces an object decoder trained on large datasets to invert CLIP embeddings, allowing label generation directly from image features without prior class knowledge.

Findings

01

Achieves up to 87.5% accuracy on standard benchmarks.

02

Operates without predefined class labels or context.

03

Supports zero-shot transfer from text to image classification.

Abstract

We introduce NOVIC, an innovative real-time uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an "object decoder" model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels from essentially the entire English language to be generated directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · Softmax · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer · Contrastive Language-Image Pre-training