VoXtream2: Full-stream TTS with dynamic speaking rate control

Nikita Torgashov; Gustav Eje Henter; Gabriel Skantze

arXiv:2603.13518·eess.AS·March 17, 2026

VoXtream2: Full-stream TTS with dynamic speaking rate control

Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze

PDF

Open Access 1 Models 2 Datasets

TL;DR

VoXtream2 is a real-time, controllable, zero-shot full-stream TTS model that enables dynamic speaking rate adjustments mid-utterance with high efficiency and quality.

Contribution

It introduces VoXtream2, a novel TTS model capable of dynamic speaking rate control and prompt-text masking, achieving high-quality synthesis with minimal latency in a zero-shot, full-stream setting.

Findings

01

Runs 4 times faster than real time on consumer GPU

02

Achieves competitive results on zero-shot benchmarks

03

Supports mid-utterance speaking rate adjustments

Abstract

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
herimor/voxtream2
model· 1.8k dl· ♡ 5
1.8k dl♡ 5

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Multimedia Communication and Technology · Interactive and Immersive Displays