StyleStream: Real-Time Zero-Shot Voice Style Conversion
Yisi Liu, Nicholas Lee, Gopala Anumanchipalli

TL;DR
StyleStream is a novel real-time, zero-shot voice style conversion system that effectively disentangles content from style and reintroduces target style using a diffusion transformer, enabling high-quality, low-latency voice transformation.
Contribution
It introduces StyleStream, the first streamable zero-shot voice style conversion system with a non-autoregressive architecture and state-of-the-art performance.
Findings
Achieves real-time conversion with 1-second latency.
Outperforms prior methods in style transfer quality.
Enables zero-shot style conversion without speaker-specific training.
Abstract
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Real-time voice style conversion is an important and challenging problem that requires transferring not only timbre but also higher-level stylistic cues (accent and emotion). Tackling this in a streaming setting is timely and practically valuable. 2. Experimental results are strong, and the demo samples sound convincing, with clear style transfer and good intelligibility. 3. The method design is well thought out: combining ASR loss with a small quantization codebook effectively improves the d
1. The main contribution lies in the task and empirical achievement—a functioning real-time voice style conversion system—rather than in methodological novelty. The ASR-supervised tokenizer and DiT-based spectrogram generator follow ideas already seen in recent works (e.g., CosyVoice2, E2-TTS, F5-TTS). Hence, the paper’s conceptual originality for the machine learning community may be somewhat limited, though it is impactful for speech research. 2. A stronger baseline would help: a simple casca
The paper utilizes a well-known method for speech disentanglement and a conditional flow matching (CFM)-based speech generation using diffusion transformers.
Overall, my primary concern lies in the lack of novelty. I could not find any clear novel contribution in this paper and it fails to include a sufficient discussion of related works. All the techniques employed are already well-known. [ASR-based linguistic information retrieval] The main contribution claimed by the paper is the destylizer, which is trained using ASR-based text supervision and FSQ-based information bottleneck. However, the tokenizer of CosyVoice was already trained with ASR pre
This work proposes a framework for real-time zero-shot voice style conversion, integrating ASR-based content supervision, a compact FSQ bottleneck, and a diffusion transformer stylizer.
- My primary concern is the lack of novelty and research contribution in the proposed methods. The paper’s main components, the destylizer, FSQ bottleneck, ASR supervision, and DiT stylizer, are all derived from prior works without introducing a truly new conceptual contribution. - The paper fails to reference several key prior studies that have already explored DiT-based in-context learning for voice conversion and style transfer, which weakens the positioning of the proposed work within exis
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Face recognition and analysis
