EAVL: Explicitly Align Vision and Language for Referring Image   Segmentation

Yichen Yan; Xingjian He; Wenxuan Wang; Sihan Chen; Jing Liu

arXiv:2308.09779·cs.CV·October 15, 2024

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu

PDF

Open Access

TL;DR

EAVL introduces a dynamic, explicit alignment of vision and language features in referring image segmentation, significantly improving fine-grained text-to-pixel correlation over previous fixed-kernel methods.

Contribution

The paper proposes a novel Vision-Language Aligner that uses dynamic convolution kernels based on input features, enhancing multi-modal feature alignment in RIS.

Findings

01

Surpasses state-of-the-art on RefCOCO, RefCOCO+, and G-Ref datasets.

02

Uses dynamic convolution kernels for better feature alignment.

03

Plug-and-play design for easy integration with existing RIS models.

Abstract

Referring image segmentation (RIS) aims to segment an object mentioned in natural language from an image. The main challenge is text-to-pixel fine-grained correlation. In the previous methods, the final results are obtained by convolutions with a fixed kernel, which follows a similar pattern as traditional image segmentation. These methods lack explicit alignment of language and vision features in the segmentation stage, resulting in suboptimal correlation. In this paper, we introduce EAVL, a method explicitly aligning vision and language features. In contrast to fixed convolution kernels, we introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence. Specifically, we generate multiple queries representing different emphases of language expression. These queries are transformed into a series…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsConvolution · ALIGN · Focus