TL;DR
This paper introduces a novel multi-modal fusion framework for referring image segmentation that performs simultaneous cross-modal and intra-modal interactions, leading to improved segmentation accuracy.
Contribution
It proposes the Synchronous Multi-Modal Fusion Module and Hierarchical Cross-Modal Aggregation Module to enhance interaction modeling and segmentation quality in RIS.
Findings
Achieves state-of-the-art performance on four benchmark datasets.
Demonstrates the effectiveness of simultaneous interaction modeling.
Provides comprehensive ablation studies confirming design choices.
Abstract
We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intramodal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach's performance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
