TL;DR
This paper introduces ADVSE, a novel approach that enhances goal-oriented visual dialogue by effectively incorporating answer effects into visual state estimation, leading to improved question generation and guessing accuracy.
Contribution
The paper proposes the Answer-Driven Visual State Estimator (ADVSE), which uses answer-driven attention and conditional information fusion to better utilize answers in visual dialogue systems.
Findings
Achieves state-of-the-art results on GuessWhat?! dataset.
Improves question efficiency and visual attention reliability.
Enhances visual dialogue agent performance.
Abstract
A goal-oriented visual dialogue involves multi-turn interactions between two agents, Questioner and Oracle. During which, the answer given by Oracle is of great significance, as it provides golden response to what Questioner concerns. Based on the answer, Questioner updates its belief on target visual content and further raises another question. Notably, different answers drive into different visual beliefs and future questions. However, existing methods always indiscriminately encode answers after much longer questions, resulting in a weak utilization of answers. In this paper, we propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states. First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answer-driven effect on visual attention by sharpening question-related attention and adjusting it by answer-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
