Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
Abhishek Das, Satwik Kottur, Jos\'e M. F. Moura, Stefan Lee, Dhruv, Batra

TL;DR
This paper presents a goal-driven approach to training visual question answering and dialog agents using deep reinforcement learning, demonstrating emergent communication and improved performance on real-image datasets.
Contribution
It introduces a cooperative multi-agent framework with end-to-end RL training for visual dialog, including emergent language without supervision and superior results on real datasets.
Findings
Agents develop their own communication protocol in synthetic environments.
RL fine-tuning outperforms supervised learning on real-image datasets.
Agents learn to ask more informative questions, improving team performance.
Abstract
We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
