TransVG++: End-to-End Visual Grounding with Language Conditioned Vision   Transformer

Jiajun Deng; Zhengyuan Yang; Daqing Liu; Tianlang Chen; Wengang Zhou,; Yanyong Zhang; Houqiang Li; Wanli Ouyang

arXiv:2206.06619·cs.CV·June 15, 2022·5 cites

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou,, Yanyong Zhang, Houqiang Li, Wanli Ouyang

PDF

Open Access 1 Repo 1 Models

TL;DR

TransVG++ introduces a fully Transformer-based framework for visual grounding that simplifies multi-modal fusion, improves training efficiency, and achieves state-of-the-art results across multiple datasets.

Contribution

It proposes TransVG++, a novel end-to-end Transformer-based model that replaces complex fusion modules with a unified architecture leveraging Vision Transformer and language-conditioned fusion.

Findings

01

Achieves state-of-the-art performance on five datasets.

02

Simplifies the fusion process with a unified Transformer architecture.

03

Demonstrates improved training efficiency and robustness.

Abstract

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

djiajunustc/TransVG
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Softmax · Absolute Position Encodings · Dropout · Adam · Residual Connection · Byte Pair Encoding · Position-Wise Feed-Forward Layer