Spoken Conversational Agents with Large Language Models

Chao-Han Huck Yang; Andreas Stolcke; Larry Heck

arXiv:2512.02593·cs.CL·March 10, 2026

Spoken Conversational Agents with Large Language Models

Chao-Han Huck Yang, Andreas Stolcke, Larry Heck

PDF

Open Access

TL;DR

This paper reviews the development of spoken conversational agents powered by large language models, discussing system architectures, adaptation techniques, datasets, and open challenges in privacy and safety.

Contribution

It provides a comprehensive overview of system designs, adaptation methods, and open problems for integrating large language models into spoken conversational agents.

Findings

01

Comparison of cascaded vs. end-to-end systems

02

Analysis of robustness across accents

03

Reproducible baselines and practical recipes

Abstract

Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Speech and dialogue systems · Speech Recognition and Synthesis