MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

Pu-Hai Yang; Heyan Huang; Heng-Da Xu; Fanshu Sun; Xian-Ling Mao; Chaoxu Mu

arXiv:2511.12586·cs.CL·November 18, 2025

MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao, Chaoxu Mu

PDF

Open Access

TL;DR

This paper introduces MMWOZ, a multimodal dataset with GUI snapshots for task-oriented dialogue, and proposes MATE, a model designed to operate in real-world scenarios lacking backend APIs.

Contribution

It creates a new multimodal dataset with GUI interactions and develops MATE, a baseline model for practical task-oriented dialogue systems in GUI-based environments.

Findings

01

MMWOZ dataset extends MultiWOZ with GUI and snapshots.

02

MATE effectively utilizes multimodal data for dialogue tasks.

03

Experimental results demonstrate MATE's potential in real-world applications.

Abstract

Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · AI in Service Interactions