Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Dongwon Kim, Namyup Kim, Cuiling Lan, Suha Kwak

TL;DR
This paper introduces a weakly supervised learning approach for referring image segmentation that uses only text descriptions for training, significantly reducing labeling costs and outperforming existing methods.
Contribution
The authors propose a novel model and loss function for weakly supervised referring image segmentation using text supervision, eliminating the need for manual pixel-level labels.
Findings
Outperforms existing methods on four benchmarks
Effective with only text descriptions as supervision
Outperforms recent open-vocabulary segmentation models
Abstract
Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading to lack of labeled data for training. We address this issue by a weakly supervised learning approach using text descriptions of training images as the only source of supervision. To this end, we first present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent. We also present a new loss function that allows the model to be trained without any further supervision. Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
