GuessWhat?! Visual object discovery through multi-modal dialogue

Harm de Vries; Florian Strub; Sarath Chandar; Olivier Pietquin; Hugo; Larochelle; Aaron Courville

arXiv:1611.08481·cs.AI·February 8, 2017·19 cites

GuessWhat?! Visual object discovery through multi-modal dialogue

Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo, Larochelle, Aaron Courville

PDF

Open Access 4 Repos

TL;DR

GuessWhat?! is a novel game-based dataset and framework that combines computer vision and dialogue to locate objects in images through multi-turn questions, enabling advancements in visual understanding and language grounding.

Contribution

The paper introduces a large-scale dataset of 150K human-played games with 800K question-answer pairs, and proposes initial deep learning models for visual object discovery via dialogue.

Findings

01

Dataset of 150K games and 800K question-answer pairs created

02

Baseline deep learning models established for the task

03

Demonstrated the importance of multi-modal reasoning in visual understanding

Abstract

We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images. We explain our design decisions in collecting the dataset and introduce the oracle and questioner tasks that are associated with the two players of the game. We prototyped deep learning models to establish initial baselines of the introduced tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning