Spoken Conversational Agents with Large Language Models
Chao-Han Huck Yang, Andreas Stolcke, Larry Heck

TL;DR
This paper reviews the development of spoken conversational agents powered by large language models, discussing system architectures, adaptation techniques, datasets, and open challenges in privacy and safety.
Contribution
It provides a comprehensive overview of system designs, adaptation methods, and open problems for integrating large language models into spoken conversational agents.
Findings
Comparison of cascaded vs. end-to-end systems
Analysis of robustness across accents
Reproducible baselines and practical recipes
Abstract
Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Speech and dialogue systems · Speech Recognition and Synthesis
