SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue
Jonggeun Lee, Junseong Pyo, Jeongmin Park, Yohan Jo

TL;DR
SpokenUS introduces a new spoken user simulator trained on a large, diverse dataset with realistic speech behaviors, improving dialogue system robustness and evaluation.
Contribution
The paper presents SpokenTOD, a large-scale spoken task-oriented dialogue dataset with diverse behaviors, and SpokenUS, a novel user simulator leveraging this data with a focus on realistic speech interactions.
Findings
SpokenUS achieves goal coverage comparable to larger models.
SpokenUS outperforms baselines in human MOS evaluations.
SpokenUS's behaviors challenge downstream dialogue agents effectively.
Abstract
Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications
