Long-Form Speech Generation with Spoken Language Models

Se Jin Park; Julian Salazar; Aren Jansen; Keisuke Kinoshita; Yong Man Ro; RJ Skerry-Ryan

arXiv:2412.18603·cs.CL·July 11, 2025

Long-Form Speech Generation with Spoken Language Models

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces SpeechSSM, a novel speech language model capable of generating coherent long-form speech over 16 minutes without text intermediates, surpassing previous models in coherence and efficiency.

Contribution

SpeechSSM is the first model to learn from and generate long-form spoken audio in a single decoding session, addressing coherence and efficiency issues of prior models.

Findings

01

SpeechSSM outperforms Transformer-based spoken LMs in coherence and efficiency.

02

Introduction of LibriSpeech-Long benchmark for long-form speech evaluation.

03

Development of new metrics for assessing long-form speech quality.

Abstract

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/librispeech-long
noneOfficial

Datasets

ilyakam/librispeech-long
dataset· 628 dl
628 dl

Videos

Long-Form Speech Generation with Spoken Language Models· slideslive

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques