Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation
Tao Tu, Qing Ping, Govind Thattai, Gokhan Tur, Prem Natarajan

TL;DR
This paper enhances visual dialog agents by integrating pretrained vision-linguistic models, specifically Vilbert, to improve understanding and reasoning about objects in images during a guessing game, leading to significant performance gains.
Contribution
It introduces Vilbert-based models for Oracle, Guesser, and Questioner, with novel fusion mechanisms and a unified framework utilizing pretrained vision-linguistic representations.
Findings
Outperforms state-of-the-art models by 7-12% across tasks.
Improves visual grounding and question understanding.
Enhances long-term dialog comprehension.
Abstract
GuessWhat?! is a two-player visual dialog guessing game where player A asks a sequence of yes/no questions (Questioner) and makes a final guess (Guesser) about a target object in an image, based on answers from player B (Oracle). Based on this dialog history between the Questioner and the Oracle, a Guesser makes a final guess of the target object. Previous baseline Oracle model encodes no visual information in the model, and it cannot fully understand complex questions about color, shape, relationships and so on. Most existing work for Guesser encode the dialog history as a whole and train the Guesser models from scratch on the GuessWhat?! dataset. This is problematic since language encoder tend to forget long-term history and the GuessWhat?! data is sparse in terms of learning visual grounding of objects. Previous work for Questioner introduces state tracking mechanism into the model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
