Recent Advances in Speech Language Models: A Survey

Wenqian Cui; Dianzhi Yu; Xiaoqi Jiao; Ziqiao Meng; Guangyan Zhang; Qichao Wang; Yiwen Guo; Irwin King

arXiv:2410.03751·cs.CL·August 8, 2025·5 cites

Recent Advances in Speech Language Models: A Survey

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King

PDF

Open Access 2 Repos 2 Datasets 1 Video

TL;DR

This paper surveys recent developments in Speech Language Models (SpeechLMs), highlighting their architecture, training methods, capabilities, evaluation metrics, challenges, and future research directions in voice-based AI interactions.

Contribution

It provides the first comprehensive overview of SpeechLMs, detailing their architecture, training recipes, capabilities, and evaluation, filling a gap in current literature.

Findings

01

SpeechLMs offer end-to-end speech generation without modality conversion.

02

They outperform traditional ASR+LLM+TTS pipelines in latency and error accumulation.

03

The survey identifies key challenges and future directions in SpeechLM research.

Abstract

Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

Recent Advances in Speech Language Models: A Survey· underline

Taxonomy

TopicsSpeech Recognition and Synthesis