CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual   Dialog

Satwik Kottur; Jos\'e M. F. Moura; Devi Parikh; Dhruv Batra; Marcus; Rohrbach

arXiv:1903.03166·cs.CV·September 20, 2019·50 cites

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Satwik Kottur, Jos\'e M. F. Moura, Devi Parikh, Dhruv Batra, Marcus, Rohrbach

PDF

Open Access 1 Repo

TL;DR

CLEVR-Dialog is a comprehensive, fully-annotated diagnostic dataset designed to evaluate multi-round reasoning in visual dialog, enabling detailed analysis of model capabilities in vision, language, and grounding tasks.

Contribution

We introduce CLEVR-Dialog, a fully-annotated diagnostic dataset for multi-round visual reasoning, facilitating detailed analysis of visual dialog models' performance.

Findings

01

Benchmarking reveals challenges in coreference resolution over dialog distance.

02

The dataset enables analysis of reasoning capabilities in visual dialog models.

03

Performance varies significantly with coreference distance.

Abstract

Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the 'state' of all images and dialogs. We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs. We use CLEVR-Dialog to benchmark performance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

satwikkottur/clevr-dialog
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Speech and dialogue systems