SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag, Koustava Goswami, Srikrishna Karanam

TL;DR
SafaRi is a weakly-supervised sequence transformer that effectively performs referring expression segmentation with limited annotations, achieving competitive results and strong zero-shot generalization.
Contribution
It introduces a novel bootstrapping architecture with attention consistency and mask validity filtering for low-annotation RES training.
Findings
Achieves nearly the same performance as fully-supervised methods with only 30% annotations.
Outperforms state-of-the-art fully-supervised methods in low-annotation settings.
Demonstrates strong zero-shot generalization capabilities.
Abstract
Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need · Attentive Walk-Aggregating Graph Neural Network
