LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, Philip, H.S. Torr

TL;DR
This paper introduces LAVT, a vision transformer model that fuses language and visual features early in the encoding process, leading to improved referring image segmentation performance.
Contribution
The paper proposes a novel early fusion approach within a vision transformer encoder for better cross-modal alignment in referring image segmentation.
Findings
Outperforms previous state-of-the-art on RefCOCO, RefCOCO+, and G-Ref datasets.
Achieves significant accuracy improvements with a lightweight mask predictor.
Demonstrates the effectiveness of early feature fusion in vision-language tasks.
Abstract
Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Vision Transformer · Absolute Position Encodings · Softmax · Residual Connection · Adam · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization
