Exploring Vision-Language Models for Imbalanced Learning
Yidong Wang, Zhuohao Yu, Jindong Wang, Qiang Heng, Hao Chen, Wei Ye,, Rui Xie, Xing Xie, Shikun Zhang

TL;DR
This paper enhances vision-language models for imbalanced datasets by adding a lightweight decoder and applying imbalanced learning techniques, significantly improving classification accuracy on challenging datasets.
Contribution
Introduces a lightweight decoder and combines imbalanced learning methods to improve VLM performance on skewed datasets, addressing memory issues and tail class recognition.
Findings
Significant accuracy improvements on ImageNet-LT, iNaturalist18, and Places-LT.
Decoder and imbalanced methods outperform baseline VLMs.
Analysis of pre-training data size, backbone, and training cost effects.
Abstract
Vision-Language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid OOM (out of memory) problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsSoftmax · Contrastive Language-Image Pre-training · Focal Loss
