CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free
Monika Wysocza\'nska, Micha\"el Ramamonjisoa, Tomasz Trzci\'nski,, Oriane Sim\'eoni

TL;DR
CLIP-DIY introduces a zero-shot semantic segmentation method that leverages CLIP's classification abilities and unsupervised localization, achieving state-of-the-art results without additional training.
Contribution
It proposes a novel open-vocabulary segmentation approach that uses CLIP and unsupervised localization, eliminating the need for extra training or annotations.
Findings
State-of-the-art zero-shot results on PASCAL VOC
Competitive performance on COCO
No additional training required
Abstract
The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free· youtube
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
