SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Feng Wang; Jieru Mei; Alan Yuille

arXiv:2312.01597·cs.CV·October 29, 2024·2 cites

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Feng Wang, Jieru Mei, Alan Yuille

PDF

Open Access 1 Repo

TL;DR

This paper introduces Correlative Self-Attention (CSA), a simple modification to CLIP's vision encoder that enables effective zero-shot semantic segmentation without retraining, significantly improving performance on multiple benchmarks.

Contribution

The paper proposes CSA, a novel self-attention mechanism that allows CLIP to perform dense prediction tasks like semantic segmentation with minimal changes and no additional training.

Findings

01

CSA achieves 38.2% average zero-shot mIoU on eight benchmarks.

02

Outperforms existing state-of-the-art methods with 33.9% mIoU.

03

Significantly better than vanilla CLIP's 14.1% mIoU.

Abstract

Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangf3014/sclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsContrastive Language-Image Pre-training