StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

Hui Wang; Yifan Yang; Shujie Liu; Jinyu Li; Lingwei Meng; Yanqing Liu; Jiaming Zhou; Haoqin Sun; Yan Lu; Yong Qin

arXiv:2506.12570·cs.SD·June 17, 2025

StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

Hui Wang, Yifan Yang, Shujie Liu, Jinyu Li, Lingwei Meng, Yanqing Liu, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

PDF

Open Access

TL;DR

StreamMel introduces a novel single-stage streaming TTS system that enables real-time, high-quality, zero-shot speech synthesis by interleaving text and acoustic modeling, outperforming existing methods in quality and latency.

Contribution

It is the first to propose a single-stage, interleaved autoregressive TTS framework that models continuous mel-spectrograms for real-time zero-shot speech synthesis.

Findings

01

Outperforms existing streaming TTS baselines in quality and latency.

02

Achieves performance comparable to offline systems in real-time generation.

03

Supports integration with real-time speech large language models.

Abstract

Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques