Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations
Wei Pang, Ruixue Duan, Jinfu Yang, Ning Li

TL;DR
This paper introduces MDST, a novel framework for visual dialog state tracking that leverages round-level dialogue information to improve answer accuracy and consistency in multi-round image-based conversations.
Contribution
The paper proposes a multi-round dialogue state tracking model that captures round-specific dialogue states to enhance visual dialog understanding and answer generation.
Findings
MDST achieves state-of-the-art results on VisDial v1.0.
Human studies confirm improved answer consistency and human-likeness.
MDST effectively grounds questions with vision-language representations.
Abstract
Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
