Descriminative-Generative Custom Tokens for Vision-Language Models

Pramuditha Perera; Matthew Trager; Luca Zancato; Alessandro Achille,; Stefano Soatto

arXiv:2502.12095·cs.CV·February 18, 2025

Descriminative-Generative Custom Tokens for Vision-Language Models

Pramuditha Perera, Matthew Trager, Luca Zancato, Alessandro Achille,, Stefano Soatto

PDF

Open Access

TL;DR

This paper introduces a method for learning custom tokens in Vision-Language Models that effectively represent new concepts for both discriminative and generative tasks, improving composition and retrieval accuracy.

Contribution

The authors propose a novel approach combining textual inversion and classification losses to learn concept tokens that align with image features in CLIP, enhancing compositionality and retrieval.

Findings

01

Improved quality of concept composition with natural language.

02

Enhanced text-to-image retrieval performance.

03

7% increase in Mean Reciprocal Retrieval on DeepFashion2.

Abstract

This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training · Sparse Evolutionary Training