Universal Speech Content Factorization

Henry Li Xinyuan; Zexin Cai; Lin Zhang; Leibny Paola Garc\'ia-Perera; Berrak Sisman; Sanjeev Khudanpur; Nicholas Andrews; Matthew Wiesner

arXiv:2603.08977·eess.AS·March 11, 2026

Universal Speech Content Factorization

Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola Garc\'ia-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

PDF

Open Access

TL;DR

Universal Speech Content Factorization (USCF) is a linear method that effectively isolates phonetic content from speaker identity in speech, enabling zero-shot voice conversion and efficient speech synthesis without extensive target data.

Contribution

USCF extends existing speech content factorization to an open-set setting using a universal mapping, enabling zero-shot voice conversion with minimal target speech and serving as a timbre-disentangled feature for TTS.

Findings

01

USCF effectively removes speaker-dependent variation.

02

USCF achieves competitive zero-shot voice conversion quality.

03

USCF features improve training efficiency for timbre-prompted TTS.

Abstract

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing