A unified one-shot prosody and speaker conversion system with   self-supervised discrete speech units

Li-Wei Chen; Shinji Watanabe; Alexander Rudnicky

arXiv:2211.06535·eess.AS·November 15, 2022

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified one-shot voice conversion system that effectively models prosody and speaker attributes using self-supervised discrete speech units, improving naturalness and content preservation.

Contribution

It proposes a cascaded modular system leveraging self-supervised discrete speech units for better prosody and content modeling in voice conversion.

Findings

01

Outperforms previous methods in naturalness and intelligibility

02

Enhances speaker transferability and prosody transferability

03

Utilizes self-supervised discrete units for improved language content preservation

Abstract

We present a unified system to realize one-shot voice conversion (VC) on the pitch, rhythm, and speaker attributes. Existing works generally ignore the correlation between prosody and language content, leading to the degradation of naturalness in converted speech. Additionally, the lack of proper language features prevents these systems from accurately preserving language content after conversion. To address these issues, we devise a cascaded modular system leveraging self-supervised discrete speech units as language representation. These discrete units provide duration information essential for rhythm modeling. Our system first extracts utterance-level prosody and speaker representations from the raw waveform. Given the prosody representation, a prosody predictor estimates pitch, energy, and duration for each discrete unit in the utterance. A synthesizer further reconstructs speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

b04901014/uuvc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing