Linguistic Query-Guided Mask Generation for Referring Image Segmentation

Zhichao Wei; Xiaohao Chen; Mingqiang Chen; Siyu Zhu

arXiv:2301.06429·cs.CV·March 23, 2023·1 cites

Linguistic Query-Guided Mask Generation for Referring Image Segmentation

Zhichao Wei, Xiaohao Chen, Mingqiang Chen, Siyu Zhu

PDF

Open Access

TL;DR

This paper introduces LGFormer, a transformer-based framework that uses linguistic queries to generate image segmentation masks, improving cross-modal alignment and segmentation consistency for image-text pairs.

Contribution

The paper presents a novel end-to-end transformer model that dynamically generates prototypes based on linguistic queries for improved referring image segmentation.

Findings

01

Outperforms existing methods on benchmark datasets.

02

Achieves better cross-modal alignment and segmentation accuracy.

03

Demonstrates robustness across diverse image-text pairs.

Abstract

Referring image segmentation aims to segment the image region of interest according to the given language expression, which is a typical multi-modal task. Existing methods either adopt the pixel classification-based or the learnable query-based framework for mask generation, both of which are insufficient to deal with various text-image pairs with a fix number of parametric prototypes. In this work, we propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation, dubbed LGFormer. It views the linguistic features as query to generate a specialized prototype for arbitrary input image-text pair, thus generating more consistent segmentation results. Moreover, we design several cross-modal interaction modules (\eg, vision-language bidirectional attention module, VLBA) in both encoder and decoder to achieve better cross-modal alignment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling

MethodsALIGN