Efficient and Long-Tailed Generalization for Pre-trained Vision-Language   Model

Jiang-Xin Shi; Chi Zhang; Tong Wei; Yu-Feng Li

arXiv:2406.12638·cs.CV·June 19, 2024

Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model

Jiang-Xin Shi, Chi Zhang, Tong Wei, Yu-Feng Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces Candle, a framework that enhances pre-trained vision-language models like CLIP for long-tailed and zero-shot scenarios by using prototype-based methods, loss adjustments, and virtual prototypes, achieving state-of-the-art results efficiently.

Contribution

Candle is a novel framework that improves long-tailed and zero-shot generalization of CLIP by using logit-adjusted loss, cross-modal attention, and virtual prototypes for new classes.

Findings

01

Achieves state-of-the-art performance on 11 datasets.

02

Reduces training time significantly.

03

Effectively handles long-tailed and zero-shot tasks.

Abstract

Pre-trained vision-language models like CLIP have shown powerful zero-shot inference ability via image-text matching and prove to be strong few-shot learners in various downstream tasks. However, in real-world scenarios, adapting CLIP to downstream tasks may encounter the following challenges: 1) data may exhibit long-tailed data distributions and might not have abundant samples for all the classes; 2) There might be emerging tasks with new classes that contain no samples at all. To overcome them, we propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle. During the training process, we propose compensating logit-adjusted loss to encourage large margins of prototypes and alleviate imbalance both within the base classes and between the base and new classes. For efficient adaptation, we treat the CLIP model as a black box and leverage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shijxcs/candle
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Digital Imaging for Blood Diseases

MethodsSoftmax · Attention Is All You Need · Balanced Selection · Contrastive Language-Image Pre-training