Beyond task success: A closer look at jointly learning to see, ask, and   GuessWhat

Ravi Shekhar; Aashish Venkatesh; Tim Baumg\"artner; Elia Bruni,; Barbara Plank; Raffaella Bernardi; Raquel Fern\'andez

arXiv:1809.03408·cs.CL·March 18, 2019·5 cites

Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat

Ravi Shekhar, Aashish Venkatesh, Tim Baumg\"artner, Elia Bruni,, Barbara Plank, Raffaella Bernardi, Raquel Fern\'andez

PDF

Open Access 3 Repos

TL;DR

This paper introduces a grounded dialogue state encoder for visual question answering in GuessWhat?!, combining multi-task and cooperative learning to improve accuracy and analyze linguistic differences beyond task success.

Contribution

It presents a novel grounded dialogue state encoder trained with multi-task and cooperative learning for visual grounding in GuessWhat?!, enhancing accuracy and providing insights into linguistic skills.

Findings

01

Joint architecture and cooperative learning improve accuracy

02

Models exhibit significant linguistic skill differences

03

Analysis highlights importance beyond task success metrics

Abstract

We propose a grounded dialogue state encoder which addresses a foundational issue on how to integrate visual grounding with dialogue system components. As a test-bed, we focus on the GuessWhat?! game, a two-player game where the goal is to identify an object in a complex visual scene by asking a sequence of yes/no questions. Our visually-grounded encoder leverages synergies between guessing and asking questions, as it is trained jointly using multi-task learning. We further enrich our model via a cooperative learning regime. We show that the introduction of both the joint architecture and cooperative learning lead to accuracy improvements over the baseline system. We compare our approach to an alternative system which extends the baseline with reinforcement learning. Our in-depth analysis shows that the linguistic skills of the two models differ dramatically, despite approaching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling