Robust Accent Identification via Voice Conversion and Non-Timbral Embeddings
Rayane Bakari, Olivier Le Blouch, Nicolas Gengembre, Nicholas Evans

TL;DR
This paper introduces a novel approach to accent identification using voice conversion for data augmentation and non-timbral embeddings, achieving state-of-the-art results and enabling accent-controlled speech synthesis.
Contribution
It proposes combining voice conversion and non-timbral embeddings to improve accent identification and transfer, addressing data scarcity and cue entanglement issues.
Findings
Achieved a new state-of-the-art F1-score of 0.66 on GenAID benchmark.
Voice conversion effectively preserves accent cues during data augmentation.
Non-timbral embeddings enable high-fidelity accent-controlled TTS.
Abstract
Automatic accent identification (AID) remains a challenging task due to the complex variability of accents, the entanglement of accent cues with speaker traits, and the scarcity of reliable accentlabelled data. To address these challenges, we propose a speaker augmentation strategy using voice conversion (VC), with which we generate additional training data by converting original training utterances into different speaker voices while preserving accentual cues. For this purpose, we select two recent VC systems and evaluate their capability to preserve accent. Alternatively, we also explore the use of non-timbral embeddings in AID, for their ability to convey accent information among other non timbral cues. The effectiveness of both methods is demonstrated on the GenAID benchmark, achieving a new state-of-the-art F1-score of 0.66, compared to the previous score of 0.55. Beyond AID, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
