ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng,, Wayne Zhang

TL;DR
This paper analyzes CLIP's architecture, identifies residual connections as a noise source, and proposes ClearCLIP, a modified model that improves open-vocabulary semantic segmentation by decomposing representations.
Contribution
It introduces ClearCLIP, a novel approach that modifies CLIP's architecture to enhance segmentation quality by removing residual connections and adjusting attention mechanisms.
Findings
ClearCLIP produces clearer, more accurate segmentation maps.
It outperforms existing methods across multiple benchmarks.
Removing residual connections reduces noise in segmentation results.
Abstract
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Residual Connection
