VLT: Vision-Language Transformer and Query Generation for Referring   Segmentation

Henghui Ding; Chang Liu; Suchen Wang; Xudong Jiang

arXiv:2210.15871·cs.CV·November 28, 2022

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang

PDF

Open Access 1 Repo

TL;DR

This paper introduces VLT, a vision-language transformer with dynamic query generation and balancing modules, improving referring segmentation by better understanding diverse language expressions and interactions with images.

Contribution

The paper presents a novel dynamic query generation and balancing framework for referring segmentation, enabling better handling of language diversity and multi-modal interactions.

Findings

01

Achieves state-of-the-art results on five datasets.

02

Effectively models diverse language expressions.

03

Improves understanding of vision-language interactions.

Abstract

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

henghuiding/Vision-Language-Transformer
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization