VoXtream2: Full-stream TTS with dynamic speaking rate control
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze

TL;DR
VoXtream2 is a real-time, controllable, zero-shot full-stream TTS model that enables dynamic speaking rate adjustments mid-utterance with high efficiency and quality.
Contribution
It introduces VoXtream2, a novel TTS model capable of dynamic speaking rate control and prompt-text masking, achieving high-quality synthesis with minimal latency in a zero-shot, full-stream setting.
Findings
Runs 4 times faster than real time on consumer GPU
Achieves competitive results on zero-shot benchmarks
Supports mid-utterance speaking rate adjustments
Abstract
Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Multimedia Communication and Technology · Interactive and Immersive Displays
