ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language   Inference

Mengcheng Lan; Chaofeng Chen; Yiping Ke; Xinjiang Wang; Litong Feng,; Wayne Zhang

arXiv:2407.12442·cs.CV·July 18, 2024

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng,, Wayne Zhang

PDF

Open Access

TL;DR

This paper analyzes CLIP's architecture, identifies residual connections as a noise source, and proposes ClearCLIP, a modified model that improves open-vocabulary semantic segmentation by decomposing representations.

Contribution

It introduces ClearCLIP, a novel approach that modifies CLIP's architecture to enhance segmentation quality by removing residual connections and adjusting attention mechanisms.

Findings

01

ClearCLIP produces clearer, more accurate segmentation maps.

02

It outperforms existing methods across multiple benchmarks.

03

Removing residual connections reduces noise in segmentation results.

Abstract

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Residual Connection