Long-Form Speech Generation with Spoken Language Models
Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

TL;DR
This paper introduces SpeechSSM, a novel speech language model capable of generating coherent long-form speech over 16 minutes without text intermediates, surpassing previous models in coherence and efficiency.
Contribution
SpeechSSM is the first model to learn from and generate long-form spoken audio in a single decoding session, addressing coherence and efficiency issues of prior models.
Findings
SpeechSSM outperforms Transformer-based spoken LMs in coherence and efficiency.
Introduction of LibriSpeech-Long benchmark for long-form speech evaluation.
Development of new metrics for assessing long-form speech quality.
Abstract
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques
