MMWOZ: Building Multimodal Agent for Task-oriented Dialogue
Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao, Chaoxu Mu

TL;DR
This paper introduces MMWOZ, a multimodal dataset with GUI snapshots for task-oriented dialogue, and proposes MATE, a model designed to operate in real-world scenarios lacking backend APIs.
Contribution
It creates a new multimodal dataset with GUI interactions and develops MATE, a baseline model for practical task-oriented dialogue systems in GUI-based environments.
Findings
MMWOZ dataset extends MultiWOZ with GUI and snapshots.
MATE effectively utilizes multimodal data for dialogue tasks.
Experimental results demonstrate MATE's potential in real-world applications.
Abstract
Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · AI in Service Interactions
