Disentangling segmental and prosodic factors to non-native speech comprehensibility
Waris Quamer, and Ricardo Gutierrez-Osuna

TL;DR
This paper introduces a novel accent conversion system that disentangles segmental and prosodic features, enabling independent manipulation to study their effects on non-native speech comprehensibility and social attitudes.
Contribution
The system uniquely separates segmental and prosodic features in non-native speech, allowing targeted manipulation and analysis of their individual impacts on comprehensibility.
Findings
Segmental features have a larger impact on comprehensibility than prosody.
Vector quantization improves prosody transfer and voice similarity.
Perceptual tests quantify contributions of speech features to intelligibility.
Abstract
Current accent conversion (AC) systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics. Being able to manipulate a non-native speaker's segmental and/or prosodic channels independently is critical to quantify how these two channels contribute to speech comprehensibility and social attitudes. We present an AC system that not only decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics. The system is able to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. We show that vector quantization of acoustic embeddings and removal of consecutive duplicated codewords allows the system to transfer prosody and improve voice similarity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
