Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model
Jiang-Xin Shi, Chi Zhang, Tong Wei, Yu-Feng Li

TL;DR
This paper introduces Candle, a framework that enhances pre-trained vision-language models like CLIP for long-tailed and zero-shot scenarios by using prototype-based methods, loss adjustments, and virtual prototypes, achieving state-of-the-art results efficiently.
Contribution
Candle is a novel framework that improves long-tailed and zero-shot generalization of CLIP by using logit-adjusted loss, cross-modal attention, and virtual prototypes for new classes.
Findings
Achieves state-of-the-art performance on 11 datasets.
Reduces training time significantly.
Effectively handles long-tailed and zero-shot tasks.
Abstract
Pre-trained vision-language models like CLIP have shown powerful zero-shot inference ability via image-text matching and prove to be strong few-shot learners in various downstream tasks. However, in real-world scenarios, adapting CLIP to downstream tasks may encounter the following challenges: 1) data may exhibit long-tailed data distributions and might not have abundant samples for all the classes; 2) There might be emerging tasks with new classes that contain no samples at all. To overcome them, we propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle. During the training process, we propose compensating logit-adjusted loss to encourage large margins of prototypes and alleviate imbalance both within the base classes and between the base and new classes. For efficient adaptation, we treat the CLIP model as a black box and leverage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Digital Imaging for Blood Diseases
MethodsSoftmax · Attention Is All You Need · Balanced Selection · Contrastive Language-Image Pre-training
