VD-BERT: A Unified Vision and Dialog Transformer with BERT

Yue Wang; Shafiq Joty; Michael R. Lyu; Irwin King; Caiming Xiong,; Steven C.H. Hoi

arXiv:2004.13278·cs.CV·November 3, 2020·30 cites

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Yue Wang, Shafiq Joty, Michael R. Lyu, Irwin King, Caiming Xiong,, Steven C.H. Hoi

PDF

Open Access 1 Repo

TL;DR

VD-BERT introduces a unified vision and dialog transformer leveraging pretrained BERT, achieving state-of-the-art results in visual dialog without external vision-language pretraining.

Contribution

It presents a unified Transformer framework that combines vision and dialog using BERT, supporting both answer ranking and generation in a single architecture.

Findings

01

Achieves top NDCG scores on visual dialog leaderboard.

02

Does not require external vision-language pretraining.

03

Supports both answer ranking and generation seamlessly.

Abstract

Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salesforce/VD-BERT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections