Modelling low-resource accents without accent-specific TTS frontend
Georgi Tinchev, Marta Czarnowska, Kamil Deja, Kayoko Yanagisawa,, Marius Cotescu

TL;DR
This paper presents a method to model low-resource accents in TTS systems without needing accent-specific frontends, by augmenting data with voice conversion and training multi-accent models, achieving state-of-the-art results.
Contribution
It introduces a novel approach combining voice conversion and multi-accent TTS training to model low-resource accents without accent-specific frontends.
Findings
Achieves state-of-the-art results in accent modelling
Effective with limited data for low-resource accents
No need for accent-specific TTS frontends
Abstract
This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS) frontend, including a grapheme-to-phoneme (G2P) module. Prior work on modelling accents assumes a phonetic transcription is available for the target accent, which might not be the case for low-resource, regional accents. In our work, we propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion, then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the donor's voice speaking in the target accent. Throughout the procedure, we use a TTS frontend developed for the same language but a different accent. We show qualitative and quantitative analysis where the proposed strategy achieves state-of-the-art results compared to other generative models. Our work demonstrates that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
