Text-Guided Mixup Towards Long-Tailed Image Categorization
Richard Franklin, Jiawei Yao, Deyang Zhong, Qi Qian, Juhua Hu

TL;DR
This paper introduces a novel text-guided mixup method leveraging vision-language models like CLIP to improve long-tailed image classification by utilizing semantic relations from textual information, showing promising empirical results.
Contribution
It proposes a new text-guided mixup technique that uses pre-trained vision-language models to address long-tailed class distributions in image categorization.
Findings
Effective in long-tailed benchmarks
Leverages semantic relations from text
Theoretical guarantees provided
Abstract
In many real-world applications, the frequency distribution of class labels for training data can exhibit a long-tailed distribution, which challenges traditional approaches of training deep neural networks that require heavy amounts of balanced data. Gathering and labeling data to balance out the class label distribution can be both costly and time-consuming. Many existing solutions that enable ensemble learning, re-balancing strategies, or fine-tuning applied to deep neural networks are limited by the inert problem of few class samples across a subset of classes. Recently, vision-language models like CLIP have been observed as effective solutions to zero-shot or few-shot learning by grasping a similarity between vision and language features for image and text pairs. Considering that large pre-trained vision-language models may contain valuable side textual information for minor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsMixup · Contrastive Language-Image Pre-training
