The Neglected Tails in Vision-Language Models

Shubham Parashar; Zhiqiu Lin; Tian Liu; Xiangjue Dong; Yanan Li; Deva; Ramanan; James Caverlee; Shu Kong

arXiv:2401.12425·cs.CV·May 24, 2024·1 cites

The Neglected Tails in Vision-Language Models

Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva, Ramanan, James Caverlee, Shu Kong

PDF

Open Access

TL;DR

This paper investigates the long-tailed distribution of concepts in vision-language models' pretraining data, revealing biases against rare concepts, and proposes a retrieval-augmented learning method that improves zero-shot recognition performance efficiently.

Contribution

It introduces a novel analysis method using LLMs to measure concept frequency and proposes REAL, a retrieval-augmented learning approach that significantly enhances VLMs' recognition of rare concepts.

Findings

01

VLMs perform poorly on rare concepts due to dataset bias.

02

Using synonyms from pretraining texts improves prompt effectiveness.

03

REAL outperforms previous methods with less data and training time.

Abstract

Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training