VU-BERT: A Unified framework for Visual Dialog

Tong Ye; Shijing Si; Jianzong Wang; Rui Wang; Ning Cheng; Jing Xiao

arXiv:2202.10787·cs.CL·February 23, 2022

VU-BERT: A Unified framework for Visual Dialog

Tong Ye, Shijing Si, Jianzong Wang, Rui Wang, Ning Cheng, Jing Xiao

PDF

Open Access

TL;DR

VU-BERT introduces a unified vision-language model for visual dialog that simplifies interaction modeling and achieves competitive results by jointly learning visual concepts and dialog dependencies.

Contribution

It proposes VU-BERT, a unified framework with patch projection for visual dialog, simplifying interaction modeling and jointly training on language and retrieval tasks.

Findings

01

Achieves 0.7287 NDCG on VisDial v1.0

02

Uses joint training on masked language modeling and next utterance retrieval

03

Simplifies visual dialog modeling with patch projection

Abstract

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling