LiteEmbed: Adapting CLIP to Rare Classes
Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

TL;DR
LiteEmbed is a lightweight method that adapts CLIP to recognize rare or unseen classes by optimizing text embeddings without retraining the entire model, improving performance across various vision tasks.
Contribution
It introduces a PCA-based subspace-guided optimization for CLIP's text embeddings, enabling effective few-shot personalization without retraining encoders.
Findings
Significant performance improvements over prior methods.
Effective across classification, retrieval, segmentation, and detection tasks.
Seamless plug-and-play integration with CLIP.
Abstract
Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
