TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization
Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna

TL;DR
TVTSyn introduces a content-synchronous, time-varying timbre representation for streaming voice conversion and anonymization, enabling low-latency, natural, and privacy-preserving speech synthesis with improved speaker transfer and intelligibility.
Contribution
It proposes a novel streamable speech synthesizer with a time-varying timbre representation that aligns content and identity at the frame level, enhancing privacy and naturalness.
Findings
Achieves <80 ms GPU latency in streaming synthesis.
Outperforms state-of-the-art baselines in naturalness and speaker transfer.
Effectively reduces residual speaker leakage through vector-quantized regularization.
Abstract
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency.…
Peer Reviews
Decision·ICLR 2026 Poster
- Authors identify a fundamental weakness (static speaker embedding) in the current speaker anonymisation techniques and proposes a well-justified fix via content-synchronous conditioning of the speaker embeddings - Well-founded experiments on VPC 2024 protocol and ablations confirm benefits across privacy, quality, and latency - Deployment conditions are kept in mind by demonstrating real-time performance on CPU/GPU under tight latency budgets (<80 ms), relevant for interactive applications - C
- The performance of B1 baseline is mentioned during the analysis but not added to Table 2 for clear comparison - The gating and Slerp mechanisms are intuitively motivated but not analyzed quantitatively (e.g., contribution to expressivity or privacy). - Listening tests use a small Mechanical Turk sample (N = 20) without statistical significance analysis or demographic breakdown. A larger cohort of listeners must be recruited (>100) and carefully selected to include demographic variations (age,
1. The problem is well framed and it shows good intuition on the mismatch between dynamic input and static speaker embedding. The solution is intuitive by introducing a timbre representation that contains better temporal information. 2. The overall system is well designed and end-to-end streamable. The proposed content encoder introduces a learnable bottleneck with factorized vector-quantization(VQ) that learns discrete, speaker-independent units while preserving linguistic fidelity. Also the ti
1.The discussion for the design of gating/interpolation are mostly based on intuition and empirical results. It would be good to include more theoretical analysis. These innovations on TVT representation and the usage of gating/slerp interpolation are more at a level of improving on top of existing architectures. Yet they are proved effective from experiment results. 2. The MOS tests are based on 20 samples which is limited and may cause bias
This paper is well-structured and provide analysis on the effcient of the proposed system.
- The paper’s central claim resolving the static-dynamic mismatch via time-varying timbre is not sufficiently novel. Prior work has already explored dynamic speaker conditioning for speech synthesis. - The dataset employed in experiments is not convince and popular, which make me confuse the correctness of the conclusion. - Incomplete baseline comparisons with voice privacy challenge baseline systems or other popular speaker anonymization systems.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Generative Adversarial Networks and Image Synthesis
