Recent Advances in Speech Language Models: A Survey
Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King

TL;DR
This paper surveys recent developments in Speech Language Models (SpeechLMs), highlighting their architecture, training methods, capabilities, evaluation metrics, challenges, and future research directions in voice-based AI interactions.
Contribution
It provides the first comprehensive overview of SpeechLMs, detailing their architecture, training recipes, capabilities, and evaluation, filling a gap in current literature.
Findings
SpeechLMs offer end-to-end speech generation without modality conversion.
They outperform traditional ASR+LLM+TTS pipelines in latency and error accumulation.
The survey identifies key challenges and future directions in SpeechLM research.
Abstract
Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis
