LCCo: Lending CLIP to Co-Segmentation
Xin Duan, Yan Yang, Liyuan Pan, Xiabi Liu

TL;DR
This paper introduces LCCo, a novel co-segmentation method leveraging CLIP's language-image pre-training to improve semantic object segmentation across image sets without requiring extra labeled data.
Contribution
LCCo integrates CLIP into a segmentation framework with three modules, enabling effective co-segmentation without relying on additional supervision or complex network engineering.
Findings
Outperforms state-of-the-art on four benchmarks
Effectively leverages CLIP for semantic consistency
Refines features in a coarse-to-fine manner
Abstract
This paper studies co-segmenting the common semantic object in a set of images. Existing works either rely on carefully engineered networks to mine the implicit semantic information in visual features or require extra data (i.e., classification labels) for training. In this paper, we leverage the contrastive language-image pre-training framework (CLIP) for the task. With a backbone segmentation network that independently processes each image from the set, we introduce semantics from CLIP into the backbone features, refining them in a coarse-to-fine manner with three key modules: i) an image set feature correspondence module, encoding global consistent semantic information of the image set; ii) a CLIP interaction module, using CLIP-mined common semantics of the image set to refine the backbone feature; iii) a CLIP regularization module, drawing CLIP towards this co-segmentation task,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsContrastive Language-Image Pre-training
