SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Jonggeun Lee; Junseong Pyo; Jeongmin Park; Yohan Jo

arXiv:2603.16783·cs.CL·March 18, 2026

SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Jonggeun Lee, Junseong Pyo, Jeongmin Park, Yohan Jo

PDF

Open Access

TL;DR

SpokenUS introduces a new spoken user simulator trained on a large, diverse dataset with realistic speech behaviors, improving dialogue system robustness and evaluation.

Contribution

The paper presents SpokenTOD, a large-scale spoken task-oriented dialogue dataset with diverse behaviors, and SpokenUS, a novel user simulator leveraging this data with a focus on realistic speech interactions.

Findings

01

SpokenUS achieves goal coverage comparable to larger models.

02

SpokenUS outperforms baselines in human MOS evaluations.

03

SpokenUS's behaviors challenge downstream dialogue agents effectively.

Abstract

Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications