Learning Cooperative Visual Dialog Agents with Deep Reinforcement   Learning

Abhishek Das; Satwik Kottur; Jos\'e M. F. Moura; Stefan Lee; Dhruv; Batra

arXiv:1703.06585·cs.CV·March 22, 2017·92 cites

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

Abhishek Das, Satwik Kottur, Jos\'e M. F. Moura, Stefan Lee, Dhruv, Batra

PDF

Open Access 5 Repos 1 Video

TL;DR

This paper presents a goal-driven approach to training visual question answering and dialog agents using deep reinforcement learning, demonstrating emergent communication and improved performance on real-image datasets.

Contribution

It introduces a cooperative multi-agent framework with end-to-end RL training for visual dialog, including emergent language without supervision and superior results on real datasets.

Findings

01

Agents develop their own communication protocol in synthetic environments.

02

RL fine-tuning outperforms supervised learning on real-image datasets.

03

Agents learn to ask more informative questions, improving team performance.

Abstract

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling