Universal Speech Content Factorization
Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola Garc\'ia-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

TL;DR
Universal Speech Content Factorization (USCF) is a linear method that effectively isolates phonetic content from speaker identity in speech, enabling zero-shot voice conversion and efficient speech synthesis without extensive target data.
Contribution
USCF extends existing speech content factorization to an open-set setting using a universal mapping, enabling zero-shot voice conversion with minimal target speech and serving as a timbre-disentangled feature for TTS.
Findings
USCF effectively removes speaker-dependent variation.
USCF achieves competitive zero-shot voice conversion quality.
USCF features improve training efficiency for timbre-prompted TTS.
Abstract
We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing
