SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
Feng Wang, Jieru Mei, Alan Yuille

TL;DR
This paper introduces Correlative Self-Attention (CSA), a simple modification to CLIP's vision encoder that enables effective zero-shot semantic segmentation without retraining, significantly improving performance on multiple benchmarks.
Contribution
The paper proposes CSA, a novel self-attention mechanism that allows CLIP to perform dense prediction tasks like semantic segmentation with minimal changes and no additional training.
Findings
CSA achieves 38.2% average zero-shot mIoU on eight benchmarks.
Outperforms existing state-of-the-art methods with 33.9% mIoU.
Significantly better than vanilla CLIP's 14.1% mIoU.
Abstract
Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsContrastive Language-Image Pre-training
