The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models
Cheng Shi, Sibei Yang

TL;DR
This paper introduces Zip, a novel method that combines CLIP and SAM models to enable annotation-free, open-vocabulary instance segmentation and object detection, significantly improving performance without human annotations.
Contribution
The paper proposes Zip, a new pipeline that leverages CLIP's boundary prior to enhance SAM for annotation-free, open-vocabulary segmentation and detection, achieving state-of-the-art results.
Findings
Zip boosts SAM's mask AP on COCO by 12.5%.
Zip achieves comparable performance to annotation-based methods.
Zip enables training-free and label-efficient segmentation.
Abstract
Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose which ips up CL and SAM in a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Image Processing and 3D Reconstruction
MethodsDense Connections · Residual Connection · Softmax · Attention Is All You Need · Layer Normalization · Linear Layer · Multi-Head Attention · Vision Transformer · self-DIstillation with NO labels · Balanced Selection
