Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation
Yabo Zhang, Zihao Wang, Jun Hao Liew, Jingjia Huang, Manyu Zhu, Jiashi, Feng, Wangmeng Zuo

TL;DR
This paper introduces a novel approach for text-supervised semantic segmentation that leverages self-supervised spatially-consistent grouping to improve region-level recognition, achieving significant performance gains on Pascal benchmarks.
Contribution
The paper proposes integrating self-supervised spatially-consistent grouping with text supervision, along with a region-level recognition adaptation using contrastive loss and masking strategies.
Findings
Achieves 59.2% mIoU on Pascal VOC
Achieves 32.4% mIoU on Pascal Context
Significantly surpasses previous state-of-the-art methods
Abstract
In this work, we investigate performing semantic segmentation solely through the training on image-sentence pairs. Due to the lack of dense annotations, existing text-supervised methods can only learn to group an image into semantic regions via pixel-insensitive feedback. As a result, their grouped results are coarse and often contain small spurious regions, limiting the upper-bound performance of segmentation. On the other hand, we observe that grouped results from self-supervised models are more semantically consistent and break the bottleneck of existing methods. Motivated by this, we introduce associate self-supervised spatially-consistent grouping with text-supervised semantic segmentation. Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition with two core designs. First, we encourage fine-grained alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsAttentive Walk-Aggregating Graph Neural Network
