TL;DR
ProxyCLIP combines the spatial accuracy of Vision Foundation Models with CLIP's semantic understanding to significantly improve open-vocabulary semantic segmentation without additional training.
Contribution
It introduces a training-free proxy attention mechanism that harmonizes VFMs and CLIP, enhancing segmentation performance across multiple benchmarks.
Findings
Average mIoU increased from 40.3 to 44.4 across eight benchmarks.
ProxyCLIP effectively bridges spatial precision and semantic richness.
The method is adaptable across different VFMs without retraining.
Abstract
Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
