Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family
Oscar Chew, Hsiao-Ying Huang, Kunal Jain, Tai-I Chen, Khoa D Doan, Kuan-Hao Huang

TL;DR
This paper identifies a persistent center bias in CLIP models, where they focus mainly on the image center, and proposes training-free strategies to mitigate this issue for better object recognition.
Contribution
The paper reveals a fundamental center bias in CLIP models and introduces training-free methods like visual prompting and attention redistribution to address it.
Findings
CLIP models tend to focus on central image regions, neglecting boundary objects.
Representation and attention analyses show off-center objects are lost during embedding aggregation.
Training-free strategies can redirect attention and reduce center bias effectively.
Abstract
Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
