Towards Motion-aware Referring Image Segmentation
Chaeyun Kim, Seunghoon Yi, Yejin Kim, Yohan Jo, Joonseok Lee

TL;DR
This paper introduces a novel approach for motion-aware referring image segmentation, utilizing data augmentation and a new contrastive learning method to improve understanding of motion-related queries in images.
Contribution
It proposes a motion-centric data augmentation scheme and Multimodal Radial Contrastive Learning (MRaCL), along with a new benchmark M-Bench for evaluating motion-related segmentation.
Findings
Significant improvement on motion-centric queries across multiple RIS models
Maintains competitive performance on appearance-based descriptions
Introduces a new benchmark for motion-focused RIS evaluation
Abstract
Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
