Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning
Hui Li, Mingjie Sun, Jimin Xiao, Eng Gee Lim, and Yao Zhao

TL;DR
This paper introduces a parallel pipeline for referring expression segmentation that isolates localization and segmentation, enabling effective weakly-supervised training with click annotations, and achieves state-of-the-art results.
Contribution
Proposes a novel parallel position-kernel-segmentation pipeline that improves RES by separating localization and segmentation, and enables weakly-supervised learning with click annotations.
Findings
Outperforms previous RES methods on multiple benchmarks.
Enables weakly-supervised RES training with click annotations.
Achieves significant performance gains in both fully- and weakly-supervised settings.
Abstract
Referring Expression Segmentation (RES), which is aimed at localizing and segmenting the target according to the given language expression, has drawn increasing attention. Existing methods jointly consider the localization and segmentation steps, which rely on the fused visual and linguistic features for both steps. We argue that the conflict between the purpose of identifying an object and generating a mask limits the RES performance. To solve this problem, we propose a parallel position-kernel-segmentation pipeline to better isolate and then interact the localization and segmentation steps. In our pipeline, linguistic information will not directly contaminate the visual feature for segmentation. Specifically, the localization step localizes the target object in the image based on the referring expression, and then the visual kernel obtained from the localization step guides the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Subtitles and Audiovisual Media
