EAVL: Explicitly Align Vision and Language for Referring Image Segmentation
Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu

TL;DR
EAVL introduces a dynamic, explicit alignment of vision and language features in referring image segmentation, significantly improving fine-grained text-to-pixel correlation over previous fixed-kernel methods.
Contribution
The paper proposes a novel Vision-Language Aligner that uses dynamic convolution kernels based on input features, enhancing multi-modal feature alignment in RIS.
Findings
Surpasses state-of-the-art on RefCOCO, RefCOCO+, and G-Ref datasets.
Uses dynamic convolution kernels for better feature alignment.
Plug-and-play design for easy integration with existing RIS models.
Abstract
Referring image segmentation (RIS) aims to segment an object mentioned in natural language from an image. The main challenge is text-to-pixel fine-grained correlation. In the previous methods, the final results are obtained by convolutions with a fixed kernel, which follows a similar pattern as traditional image segmentation. These methods lack explicit alignment of language and vision features in the segmentation stage, resulting in suboptimal correlation. In this paper, we introduce EAVL, a method explicitly aligning vision and language features. In contrast to fixed convolution kernels, we introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence. Specifically, we generate multiple queries representing different emphases of language expression. These queries are transformed into a series…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
MethodsConvolution · ALIGN · Focus
