VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang

TL;DR
This paper introduces VLT, a vision-language transformer with dynamic query generation and balancing modules, improving referring segmentation by better understanding diverse language expressions and interactions with images.
Contribution
The paper presents a novel dynamic query generation and balancing framework for referring segmentation, enabling better handling of language diversity and multi-modal interactions.
Findings
Achieves state-of-the-art results on five datasets.
Effectively models diverse language expressions.
Improves understanding of vision-language interactions.
Abstract
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization
