Mask Grounding for Referring Image Segmentation
Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, Gao Huang

TL;DR
This paper introduces MagNet, a novel approach for Referring Image Segmentation that employs Mask Grounding and cross-modal alignment to improve fine-grained visual-language correspondence, leading to state-of-the-art results.
Contribution
The paper proposes Mask Grounding as an auxiliary task and a cross-modal alignment module to enhance visual grounding and address modality gaps in RIS.
Findings
Significant performance improvements on RefCOCO, RefCOCO+, and G-Ref benchmarks.
Effective integration of Mask Grounding with existing RIS methods.
Outperforms previous state-of-the-art approaches.
Abstract
Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
