A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units
Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

TL;DR
This paper introduces a unified one-shot voice conversion system that effectively models prosody and speaker attributes using self-supervised discrete speech units, improving naturalness and content preservation.
Contribution
It proposes a cascaded modular system leveraging self-supervised discrete speech units for better prosody and content modeling in voice conversion.
Findings
Outperforms previous methods in naturalness and intelligibility
Enhances speaker transferability and prosody transferability
Utilizes self-supervised discrete units for improved language content preservation
Abstract
We present a unified system to realize one-shot voice conversion (VC) on the pitch, rhythm, and speaker attributes. Existing works generally ignore the correlation between prosody and language content, leading to the degradation of naturalness in converted speech. Additionally, the lack of proper language features prevents these systems from accurately preserving language content after conversion. To address these issues, we devise a cascaded modular system leveraging self-supervised discrete speech units as language representation. These discrete units provide duration information essential for rhythm modeling. Our system first extracts utterance-level prosody and speaker representations from the raw waveform. Given the prosody representation, a prosody predictor estimates pitch, energy, and duration for each discrete unit in the utterance. A synthesizer further reconstructs speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
