SpeakStream: Streaming Text-to-Speech with Interleaved Data

Richard He Bai; Zijin Gu; Tatiana Likhomanenko; Navdeep Jaitly

arXiv:2505.19206·cs.CL·May 27, 2025

SpeakStream: Streaming Text-to-Speech with Interleaved Data

Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

PDF

Open Access

TL;DR

SpeakStream is a streaming text-to-speech system that produces incremental audio from streaming text, significantly reducing latency in conversational AI applications while maintaining high speech quality.

Contribution

It introduces a decoder-only streaming TTS architecture trained on interleaved data, enabling low-latency speech synthesis suitable for real-time conversational agents.

Findings

01

Achieves state-of-the-art first-token latency

02

Maintains high speech quality comparable to non-streaming TTS

03

Effective for cascaded conversational AI systems

Abstract

The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis