GuessWhat?! Visual object discovery through multi-modal dialogue
Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo, Larochelle, Aaron Courville

TL;DR
GuessWhat?! is a novel game-based dataset and framework that combines computer vision and dialogue to locate objects in images through multi-turn questions, enabling advancements in visual understanding and language grounding.
Contribution
The paper introduces a large-scale dataset of 150K human-played games with 800K question-answer pairs, and proposes initial deep learning models for visual object discovery via dialogue.
Findings
Dataset of 150K games and 800K question-answer pairs created
Baseline deep learning models established for the task
Demonstrated the importance of multi-modal reasoning in visual understanding
Abstract
We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images. We explain our design decisions in collecting the dataset and introduce the oracle and questioner tasks that are associated with the two players of the game. We prototyped deep learning models to establish initial baselines of the introduced tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
