VU-BERT: A Unified framework for Visual Dialog
Tong Ye, Shijing Si, Jianzong Wang, Rui Wang, Ning Cheng, Jing Xiao

TL;DR
VU-BERT introduces a unified vision-language model for visual dialog that simplifies interaction modeling and achieves competitive results by jointly learning visual concepts and dialog dependencies.
Contribution
It proposes VU-BERT, a unified framework with patch projection for visual dialog, simplifying interaction modeling and jointly training on language and retrieval tasks.
Findings
Achieves 0.7287 NDCG on VisDial v1.0
Uses joint training on masked language modeling and next utterance retrieval
Simplifies visual dialog modeling with patch projection
Abstract
The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling
