VD-BERT: A Unified Vision and Dialog Transformer with BERT
Yue Wang, Shafiq Joty, Michael R. Lyu, Irwin King, Caiming Xiong,, Steven C.H. Hoi

TL;DR
VD-BERT introduces a unified vision and dialog transformer leveraging pretrained BERT, achieving state-of-the-art results in visual dialog without external vision-language pretraining.
Contribution
It presents a unified Transformer framework that combines vision and dialog using BERT, supporting both answer ranking and generation in a single architecture.
Findings
Achieves top NDCG scores on visual dialog leaderboard.
Does not require external vision-language pretraining.
Supports both answer ranking and generation seamlessly.
Abstract
Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections
