Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic   Representation

Tao Tu; Qing Ping; Govind Thattai; Gokhan Tur; Prem Natarajan

arXiv:2105.11541·cs.CV·May 26, 2021

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

Tao Tu, Qing Ping, Govind Thattai, Gokhan Tur, Prem Natarajan

PDF

Open Access 1 Repo

TL;DR

This paper enhances visual dialog agents by integrating pretrained vision-linguistic models, specifically Vilbert, to improve understanding and reasoning about objects in images during a guessing game, leading to significant performance gains.

Contribution

It introduces Vilbert-based models for Oracle, Guesser, and Questioner, with novel fusion mechanisms and a unified framework utilizing pretrained vision-linguistic representations.

Findings

01

Outperforms state-of-the-art models by 7-12% across tasks.

02

Improves visual grounding and question understanding.

03

Enhances long-term dialog comprehension.

Abstract

GuessWhat?! is a two-player visual dialog guessing game where player A asks a sequence of yes/no questions (Questioner) and makes a final guess (Guesser) about a target object in an image, based on answers from player B (Oracle). Based on this dialog history between the Questioner and the Oracle, a Guesser makes a final guess of the target object. Previous baseline Oracle model encodes no visual information in the model, and it cannot fully understand complex questions about color, shape, relationships and so on. Most existing work for Guesser encode the dialog history as a whole and train the Guesser models from scratch on the GuessWhat?! dataset. This is problematic since language encoder tend to forget long-term history and the GuessWhat?! data is sparse in terms of learning visual grounding of objects. Previous work for Questioner introduces state tracking mechanism into the model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-research/read-up
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques