Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

Feng Lin; Marco Chen; Haokui Zhang; Xiaotian Yu; Guangming Lu; Rong Xiao

arXiv:2507.00537·cs.CV·November 18, 2025

Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

Feng Lin, Marco Chen, Haokui Zhang, Xiaotian Yu, Guangming Lu, Rong Xiao

PDF

Open Access

TL;DR

This paper analyzes the impact of individual attention heads in CLIP's image encoder, identifying and ablating detrimental heads to improve downstream performance with minimal overhead.

Contribution

It introduces the Attention Ablation Technique (AAT) to systematically identify and suppress harmful attention heads in CLIP, enhancing its effectiveness across various tasks.

Findings

01

AAT improves downstream performance by up to 11.1% in recall.

02

Certain attention heads are found to be detrimental to representations.

03

AAT requires minimal additional inference cost.

Abstract

This paper investigates the role of attention heads in CLIP's image encoder. Building on interpretability studies, we conduct an exhaustive analysis and find that certain heads, distributed across layers, are detrimental to the resulting representations. To mitigate their impact, we propose a simple yet effective Attention Ablation Technique (AAT) that suppresses selected heads by directly manipulating their attention weights. By incorporating two complementary strategies tailored to different application scenarios, AAT enables the systematic identification and ablation of harmful heads with minimal overhead. Experiments show that AAT consistently improves downstream performance across diverse domains, boosting recall by up to 11.1% on cross-modal retrieval benchmarks. These results highlight that AAT can effectively refine large-scale VLMs with virtually no extra inference cost, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications