Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Wei Jie Yeo; Rui Mao; Moloud Abdar; Erik Cambria; Ranjan Satapathy

arXiv:2505.17425·cs.CV·May 26, 2025

Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Wei Jie Yeo, Rui Mao, Moloud Abdar, Erik Cambria, Ranjan Satapathy

PDF

3 Reviews

TL;DR

This paper introduces LTC, a framework for identifying and mitigating bias in CLIP's attention heads, improving fairness and interpretability in multimodal models through targeted ablation and feature integration.

Contribution

LTC is a novel contrastive method that locates spurious and salient attention heads in Vision Transformers and enhances classification by targeted correction.

Findings

01

Achieved over 50% improvement in worst-group accuracy on biased benchmarks.

02

Effectively visualized and interpreted attention heads, confirming the mechanism.

03

Demonstrated the ability to reduce bias and improve model fairness.

Abstract

Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $> 50%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The method enhances interpretability by quantifying how much each attention head contributes to spurious predictions, revealing which internal components drive biased behavior. Moreover, by visualizing these heads, it allows clear inspection of which image regions lead to incorrect or biased decisions. 2. The proposed method introduces minimal modification to the original CLIP architecture, avoiding retraining and preserving the model’s original zero-shot capabilities. 3. The approach demonst

Weaknesses

1. My main concern is that the paper insufficiently discusses related work and does not provide a comparison in the table. A large body of recent literature has explored bias identification and mitigation in vision–language models, including both training-based and training-free approaches [1–9]. While the paper contrasts its approach with a few baselines, a more in-depth comparison is needed. For instance, B2T [1] mitigates bias by adding bias-related keywords to prompts, but such recent approa

Reviewer 02Rating 4Confidence 3

Strengths

1. Novel and Intuitive Location Method: The core contribution, the contrastive method for locating $P_S$ and $P_Y$ heads by comparing $G_{NC}$ and $G_{NW}$, is clever and well-motivated. 2. Targeted and Mechanistic Intervention: Unlike existing methods that apply a global correction to the final image or text representation, LTC performs intervention on specific attention heads. This mechanistic approach is more granular and effective. 3. Good Interpretability: The paper provides strong qualitat

Weaknesses

1. Clarity and Readability: The paper is difficult to read, with some parts of the exposition, particularly the dense methodology, being particularly difficult. Would benefit from clearer diagrams or worked-out toy examples. 2. Generalizability of Located Heads: The "Locate" step is inherently dataset-dependent, as it requires a set of samples (even if a validation set) to identify $P_S$ and $P_Y$. The paper shows one instance of generalization (reusing heads from GenderBias-VL for FairFace) and

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper is very well-written, well-presented, and easy to follow. 2. The methodology is technically solid, with clear and correct mathematical notations. 3. Illustrations and tables are clear and effectively support the paper's claims.

Weaknesses

Despite the paper's strengths, the primary weakness lies in the discussion and comparison to related work. While the application of Causal Mediation Analysis is interesting, the core "locate-and-correct" idea is not unique. The authors, perhaps due to the timing of publication, seem to have missed several recent and highly relevant papers that explore similar ideas and show strong performance in debiasing. The authors should discuss these papers and provide a clear rationale for why the propose

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.