Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion
Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen, Meng

TL;DR
This paper introduces a novel voice conversion system that explicitly models prosody and employs deep speaker embeddings to improve conversion quality for atypical speech, such as dysarthric and L2 speech, enhancing intelligibility and speaker similarity.
Contribution
It proposes a new VC framework with explicit prosodic modeling and deep speaker embedding learning tailored for atypical speech, addressing challenges in prosody correction and speaker identity preservation.
Findings
Speaker adaptation improves speaker similarity.
The speaker encoder reduces pronunciation errors.
Converted speech shows significant CER and WER reduction.
Abstract
Though significant progress has been made for the voice conversion (VC) of typical speech, VC for atypical speech, e.g., dysarthric and second-language (L2) speech, remains a challenge, since it involves correcting for atypical prosody while maintaining speaker identity. To address this issue, we propose a VC system with explicit prosodic modelling and deep speaker embedding (DSE) learning. First, a speech-encoder strives to extract robust phoneme embeddings from atypical speech. Second, a prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. Third, a conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech, conditioned on the target DSE that is learned via speaker encoder or speaker adaptation. Extensive experiments demonstrate that speaker adaptation can achieve higher speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing
