MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim

TL;DR
MaskRIS introduces a novel data augmentation framework for Referring Image Segmentation that employs image and text masking combined with distortion-aware learning, significantly improving robustness and achieving state-of-the-art results.
Contribution
The paper proposes MaskRIS, a new training framework utilizing masking strategies and distortion-aware learning to enhance RIS performance beyond existing methods.
Findings
Outperforms existing RIS methods on multiple datasets
Improves robustness to occlusions and linguistic complexities
Effective in both supervised and weakly supervised settings
Abstract
Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Weight Decay · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · BERT · Vision Transformer · Stochastic Depth · Swin Transformer
