ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary   Segmentation

Mengcheng Lan; Chaofeng Chen; Yiping Ke; Xinjiang Wang; Litong Feng,; Wayne Zhang

arXiv:2408.04883·cs.CV·August 12, 2024

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng,, Wayne Zhang

PDF

1 Repo

TL;DR

ProxyCLIP combines the spatial accuracy of Vision Foundation Models with CLIP's semantic understanding to significantly improve open-vocabulary semantic segmentation without additional training.

Contribution

It introduces a training-free proxy attention mechanism that harmonizes VFMs and CLIP, enhancing segmentation performance across multiple benchmarks.

Findings

01

Average mIoU increased from 40.3 to 44.4 across eight benchmarks.

02

ProxyCLIP effectively bridges spatial precision and semantic richness.

03

The method is adaptable across different VFMs without retraining.

Abstract

Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mc-lan/proxyclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training