CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander

TL;DR
This paper identifies a bias in CLIP models towards focusing on the first sentence of captions and proposes DeBias-CLIP, a training method that distributes attention across entire captions, improving long-text retrieval.
Contribution
The authors introduce DeBias-CLIP, a simple yet effective training technique that mitigates the shortcut bias in CLIP by removing the summary sentence and sampling tokens, enhancing multi-sentence caption understanding.
Findings
DeBias-CLIP achieves state-of-the-art long-text retrieval performance.
It improves short-text retrieval accuracy.
The method is less sensitive to sentence order permutations.
Abstract
CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
