Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat
Ravi Shekhar, Aashish Venkatesh, Tim Baumg\"artner, Elia Bruni,, Barbara Plank, Raffaella Bernardi, Raquel Fern\'andez

TL;DR
This paper introduces a grounded dialogue state encoder for visual question answering in GuessWhat?!, combining multi-task and cooperative learning to improve accuracy and analyze linguistic differences beyond task success.
Contribution
It presents a novel grounded dialogue state encoder trained with multi-task and cooperative learning for visual grounding in GuessWhat?!, enhancing accuracy and providing insights into linguistic skills.
Findings
Joint architecture and cooperative learning improve accuracy
Models exhibit significant linguistic skill differences
Analysis highlights importance beyond task success metrics
Abstract
We propose a grounded dialogue state encoder which addresses a foundational issue on how to integrate visual grounding with dialogue system components. As a test-bed, we focus on the GuessWhat?! game, a two-player game where the goal is to identify an object in a complex visual scene by asking a sequence of yes/no questions. Our visually-grounded encoder leverages synergies between guessing and asking questions, as it is trained jointly using multi-task learning. We further enrich our model via a cooperative learning regime. We show that the introduction of both the joint architecture and cooperative learning lead to accuracy improvements over the baseline system. We compare our approach to an alternative system which extends the baseline with reinforcement learning. Our in-depth analysis shows that the linguistic skills of the two models differ dramatically, despite approaching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
