SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring   Expression Segmentation

Sayan Nag; Koustava Goswami; Srikrishna Karanam

arXiv:2407.02389·cs.CV·July 3, 2024

SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Sayan Nag, Koustava Goswami, Srikrishna Karanam

PDF

Open Access

TL;DR

SafaRi is a weakly-supervised sequence transformer that effectively performs referring expression segmentation with limited annotations, achieving competitive results and strong zero-shot generalization.

Contribution

It introduces a novel bootstrapping architecture with attention consistency and mask validity filtering for low-annotation RES training.

Findings

01

Achieves nearly the same performance as fully-supervised methods with only 30% annotations.

02

Outperforms state-of-the-art fully-supervised methods in low-annotation settings.

03

Demonstrates strong zero-shot generalization capabilities.

Abstract

Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need · Attentive Walk-Aggregating Graph Neural Network