StyleStream: Real-Time Zero-Shot Voice Style Conversion

Yisi Liu; Nicholas Lee; Gopala Anumanchipalli

arXiv:2602.20113·cs.SD·February 24, 2026

StyleStream: Real-Time Zero-Shot Voice Style Conversion

Yisi Liu, Nicholas Lee, Gopala Anumanchipalli

PDF

Open Access 3 Reviews

TL;DR

StyleStream is a novel real-time, zero-shot voice style conversion system that effectively disentangles content from style and reintroduces target style using a diffusion transformer, enabling high-quality, low-latency voice transformation.

Contribution

It introduces StyleStream, the first streamable zero-shot voice style conversion system with a non-autoregressive architecture and state-of-the-art performance.

Findings

01

Achieves real-time conversion with 1-second latency.

02

Outperforms prior methods in style transfer quality.

03

Enables zero-shot style conversion without speaker-specific training.

Abstract

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. Real-time voice style conversion is an important and challenging problem that requires transferring not only timbre but also higher-level stylistic cues (accent and emotion). Tackling this in a streaming setting is timely and practically valuable. 2. Experimental results are strong, and the demo samples sound convincing, with clear style transfer and good intelligibility. 3. The method design is well thought out: combining ASR loss with a small quantization codebook effectively improves the d

Weaknesses

1. The main contribution lies in the task and empirical achievement—a functioning real-time voice style conversion system—rather than in methodological novelty. The ASR-supervised tokenizer and DiT-based spectrogram generator follow ideas already seen in recent works (e.g., CosyVoice2, E2-TTS, F5-TTS). Hence, the paper’s conceptual originality for the machine learning community may be somewhat limited, though it is impactful for speech research. 2. A stronger baseline would help: a simple casca

Reviewer 02Rating 0Confidence 5

Strengths

The paper utilizes a well-known method for speech disentanglement and a conditional flow matching (CFM)-based speech generation using diffusion transformers.

Weaknesses

Overall, my primary concern lies in the lack of novelty. I could not find any clear novel contribution in this paper and it fails to include a sufficient discussion of related works. All the techniques employed are already well-known. [ASR-based linguistic information retrieval] The main contribution claimed by the paper is the destylizer, which is trained using ASR-based text supervision and FSQ-based information bottleneck. However, the tokenizer of CosyVoice was already trained with ASR pre

Reviewer 03Rating 0Confidence 5

Strengths

This work proposes a framework for real-time zero-shot voice style conversion, integrating ASR-based content supervision, a compact FSQ bottleneck, and a diffusion transformer stylizer.

Weaknesses

- My primary concern is the lack of novelty and research contribution in the proposed methods. The paper’s main components, the destylizer, FSQ bottleneck, ASR supervision, and DiT stylizer, are all derived from prior works without introducing a truly new conceptual contribution. - The paper fails to reference several key prior studies that have already explored DiT-based in-context learning for voice conversion and style transfer, which weakens the positioning of the proposed work within exis

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Face recognition and analysis