Descriminative-Generative Custom Tokens for Vision-Language Models
Pramuditha Perera, Matthew Trager, Luca Zancato, Alessandro Achille,, Stefano Soatto

TL;DR
This paper introduces a method for learning custom tokens in Vision-Language Models that effectively represent new concepts for both discriminative and generative tasks, improving composition and retrieval accuracy.
Contribution
The authors propose a novel approach combining textual inversion and classification losses to learn concept tokens that align with image features in CLIP, enhancing compositionality and retrieval.
Findings
Improved quality of concept composition with natural language.
Enhanced text-to-image retrieval performance.
7% increase in Mean Reciprocal Retrieval on DeepFashion2.
Abstract
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training · Sparse Evolutionary Training
