TL;DR
This paper introduces a multi-modal dialogue state tracking model for the GuessWhich game, enabling a Questioner Bot to perform visual reasoning through mental imagery, leading to state-of-the-art results on VisDial datasets.
Contribution
It proposes a novel mental imagery-based dialogue state tracking approach for visual reasoning in GuessWhich, improving over existing methods that lack visual context.
Findings
Achieves new state-of-the-art performance on VisDial datasets.
Effectively models visually related reasoning through mental imagery.
Demonstrates robustness across multiple dataset versions.
Abstract
GuessWhich is an engaging visual dialogue game that involves interaction between a Questioner Bot (QBot) and an Answer Bot (ABot) in the context of image-guessing. In this game, QBot's objective is to locate a concealed image solely through a series of visually related questions posed to ABot. However, effectively modeling visually related reasoning in QBot's decision-making process poses a significant challenge. Current approaches either lack visual information or rely on a single real image sampled at each round as decoding context, both of which are inadequate for visual reasoning. To address this limitation, we propose a novel approach that focuses on visually related reasoning through the use of a mental model of the undisclosed image. Within this framework, QBot learns to represent mental imagery, enabling robust visual reasoning by tracking the dialogue state. The dialogue state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
