Debiasing Vison-Language Models with Text-Only Training
Yunfan Yang, Chaoquan Jiang, Zhiyu Lin, Jinlin Xiao, Jiaming Zhang,, Jitao Sang

TL;DR
This paper introduces TOD, a text-only debiasing framework for vision-language models that uses a text-as-image paradigm and a multi-target prediction task to improve bias mitigation without requiring image data.
Contribution
The paper proposes a novel text-only debiasing method for VLMs that repurposes text encoders as image encoders and employs a multi-target prediction task to reduce bias.
Findings
Achieves state-of-the-art results among image-free debiasing methods.
Significantly improves group robustness on Waterbirds and CelebA datasets.
Demonstrates strong generalization to multiple and unknown bias attributes.
Abstract
Pre-trained vision-language models (VLMs), such as CLIP, have exhibited remarkable performance across various downstream tasks by aligning text and images in a unified embedding space. However, due to the imbalanced distribution of pre-trained datasets, CLIP suffers from the bias problem in real-world applications. Existing debiasing methods struggle to obtain sufficient image samples for minority groups and incur high costs for group labeling. To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases. Specifically, this approach repurposes the text encoder to function as an image encoder, thereby eliminating the need for image data. Simultaneously, it utilizes a large language model (LLM) to generate a balanced text dataset, which is then used for prompt tuning. However, we observed that the…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- **Motivation**—The authors propose a text-only framework to mitigate the expense of image-based fine-tuning. This is based on the fact that limited data is available for minority groups, and labeling them can be quite expensive. Through a text-only approach utilizing LLMs, the proposed method can circumvent this expense and generate a balanced dataset for prompt tuning. - **Results** - The proposed method TOD achieves significant improvements over prior works in various attribute settings in C
### (a) Intuition behind the proposed approach - The proposed approach - Text-only Debiasing (TOD), is based on the idea that the optimized learnable text prompts are directly applicable to the image modality due to the image and text modalities sharing a unified representation space. However, [R1] shows that the image and text embeddings of CLIP are located in two separate regions in the representation space. - [R1] essentially contradicts one of the fundamental motivating ideas for TOD. This c
- The authors use the design of the Attribute Balanced Dataset to achieve debiasing with only text information. - The experiments are thorough, analyzing not only standard benchmarks but also cases with multiple bias attributes and unknown bias.
- This paper claims to debias VLMs, but only evaluates on CLIP models in the experiments. Can the proposed method be applied to debias other VLMs as well? - The proposed method involves prompt tuning, which could be costly. Could the authors provide a detailed time comparison with other baselines?
- This paper is generally well organized and presented. - It is well motivated to take text as image in CLIP embedding space and generate balanced text data to address the bias issue of CLIP. - Overall, the experiments in the paper are quite thorough, especially the loss curve in Figure 2, which effectively validates the effectiveness of the multi-objective prediction task in alleviating overfitting.
1. The paper needs careful proofreading, some writing errors are as follows: - demonstrate -> demonstrates, line 041; - false attributes -> bias attributes, line 183; - Both training and inference process is -> Both training and inference processes are, line 184; - we use using -> we use, line 183; - $C_b$ -> $C_B$, line 244; - $\frac{<\cdot,\cdot>\tau}{\cdots}$ -> $\frac{<\cdot,\cdot>/\tau}{\cdots}$, Eq.2 and Eq.4. - sSo -> So, line 277; 2. Eq.5 is confusing. If I understand correctly, $p(y=i,b
1. Reducing bias in vision-language models through text-only training is an interesting topic. 2. The experimental results on Waterbirds and CelebA achieving performance comparable to SOTA image-supervised methods. 3. Figure 2 visually demonstrates the motivation of Multi-Target Prediction.
1. Not ready for submission: less than 10 pages. 2. The writing is poor, many typos, like 'use using' in Line 186, 'sSO' in Line278. 3. The text generation and MTP are limited in novelty, although the effect seems to be okay.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsFocus · Contrastive Language-Image Pre-training
