LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Zhao Yang; Jiaqi Wang; Yansong Tang; Kai Chen; Hengshuang Zhao; Philip; H.S. Torr

arXiv:2112.02244·cs.CV·April 7, 2022·22 cites

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, Philip, H.S. Torr

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces LAVT, a vision transformer model that fuses language and visual features early in the encoding process, leading to improved referring image segmentation performance.

Contribution

The paper proposes a novel early fusion approach within a vision transformer encoder for better cross-modal alignment in referring image segmentation.

Findings

01

Outperforms previous state-of-the-art on RefCOCO, RefCOCO+, and G-Ref datasets.

02

Achieves significant accuracy improvements with a lightweight mask predictor.

03

Demonstrates the effectiveness of early feature fusion in vision-language tasks.

Abstract

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yz93/lavt-ris
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Vision Transformer · Absolute Position Encodings · Softmax · Residual Connection · Adam · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization