TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

Waris Quamer; Mu-Ruei Tseng; Ghady Nasrallah; Ricardo Gutierrez-Osuna

arXiv:2602.09389·eess.AS·February 11, 2026

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna

PDF

Open Access 3 Reviews

TL;DR

TVTSyn introduces a content-synchronous, time-varying timbre representation for streaming voice conversion and anonymization, enabling low-latency, natural, and privacy-preserving speech synthesis with improved speaker transfer and intelligibility.

Contribution

It proposes a novel streamable speech synthesizer with a time-varying timbre representation that aligns content and identity at the frame level, enhancing privacy and naturalness.

Findings

01

Achieves <80 ms GPU latency in streaming synthesis.

02

Outperforms state-of-the-art baselines in naturalness and speaker transfer.

03

Effectively reduces residual speaker leakage through vector-quantized regularization.

Abstract

Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 5

Strengths

- Authors identify a fundamental weakness (static speaker embedding) in the current speaker anonymisation techniques and proposes a well-justified fix via content-synchronous conditioning of the speaker embeddings - Well-founded experiments on VPC 2024 protocol and ablations confirm benefits across privacy, quality, and latency - Deployment conditions are kept in mind by demonstrating real-time performance on CPU/GPU under tight latency budgets (<80 ms), relevant for interactive applications - C

Weaknesses

- The performance of B1 baseline is mentioned during the analysis but not added to Table 2 for clear comparison - The gating and Slerp mechanisms are intuitively motivated but not analyzed quantitatively (e.g., contribution to expressivity or privacy). - Listening tests use a small Mechanical Turk sample (N = 20) without statistical significance analysis or demographic breakdown. A larger cohort of listeners must be recruited (>100) and carefully selected to include demographic variations (age,

Reviewer 02Rating 6Confidence 4

Strengths

1. The problem is well framed and it shows good intuition on the mismatch between dynamic input and static speaker embedding. The solution is intuitive by introducing a timbre representation that contains better temporal information. 2. The overall system is well designed and end-to-end streamable. The proposed content encoder introduces a learnable bottleneck with factorized vector-quantization(VQ) that learns discrete, speaker-independent units while preserving linguistic fidelity. Also the ti

Weaknesses

1.The discussion for the design of gating/interpolation are mostly based on intuition and empirical results. It would be good to include more theoretical analysis. These innovations on TVT representation and the usage of gating/slerp interpolation are more at a level of improving on top of existing architectures. Yet they are proved effective from experiment results. 2. The MOS tests are based on 20 samples which is limited and may cause bias

Reviewer 03Rating 2Confidence 5

Strengths

This paper is well-structured and provide analysis on the effcient of the proposed system.

Weaknesses

- The paper’s central claim resolving the static-dynamic mismatch via time-varying timbre is not sufficiently novel. Prior work has already explored dynamic speaker conditioning for speech synthesis. - The dataset employed in experiments is not convince and popular, which make me confuse the correctness of the conclusion. - Incomplete baseline comparisons with voice privacy challenge baseline systems or other popular speaker anonymization systems.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Generative Adversarial Networks and Image Synthesis