PhraseCut: Language-based Image Segmentation in the Wild
Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, Subhransu Maji

TL;DR
This paper introduces a large-scale dataset for language-based image segmentation and proposes a modular approach that effectively handles diverse and complex referring phrases, outperforming existing methods.
Contribution
The paper presents a new dataset with over 77,000 images and 345,000 phrase-region pairs, and a modular segmentation method that improves over prior approaches.
Findings
The dataset challenges existing segmentation models with diverse concepts.
The proposed method outperforms previous approaches on the new dataset.
Handling long-tail concepts improves segmentation accuracy.
Abstract
We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77,262 images and 345,486 phrase-region pairs. Our dataset is collected on top of the Visual Genome dataset and uses the existing annotations to generate a challenging set of referring phrases for which the corresponding regions are manually annotated. Phrases in our dataset correspond to multiple regions and describe a large number of object and stuff categories as well as their attributes such as color, shape, parts, and relationships with other entities in the image. Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art. We systematically handle the long-tail nature of these concepts and present a modular approach to combine category, attribute, and relationship cues that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
PhraseCut: Language-Based Image Segmentation in the Wild· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
