Multimodal Dialogue State Tracking By QA Approach with Data Augmentation
Xiangyang Mou, Brandyn Sigouin, Ian Steenstra, Hui Su

TL;DR
This paper introduces a multimodal QA-based approach with data augmentation for dialogue state tracking in video-related conversations, improving performance on AVSD tasks.
Contribution
It presents a novel QA framework for AVSD, incorporating data augmentation techniques to enhance dialogue understanding and state tracking accuracy.
Findings
Significant performance improvements over baseline models.
Effective data augmentation enhances model robustness.
Demonstrates potential of QA approach in multimodal dialogue tasks.
Abstract
Recently, a more challenging state tracking task, Audio-Video Scene-Aware Dialogue (AVSD), is catching an increasing amount of attention among researchers. Different from purely text-based dialogue state tracking, the dialogue in AVSD contains a sequence of question-answer pairs about a video and the final answer to the given question requires additional understanding of the video. This paper interprets the AVSD task from an open-domain Question Answering (QA) point of view and proposes a multimodal open-domain QA system to deal with the problem. The proposed QA system uses common encoder-decoder framework with multimodal fusion and attention. Teacher forcing is applied to train a natural language generator. We also propose a new data augmentation approach specifically under QA assumption. Our experiments show that our model and techniques bring significant improvements over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition
