Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters
Alessio Falai, Ziyao Zhang, Akos Gangoly

TL;DR
This paper explores the use of adapters in lightweight cross-lingual TTS systems to enable unseen speaker and language adaptation, maintaining original model capabilities and introducing an objective accent similarity metric.
Contribution
It demonstrates the effectiveness of adapters for unseen speaker and language adaptation in lightweight TTS and proposes a new metric for accent naturalness evaluation.
Findings
Adapters effectively learn speaker and language-specific info
Adapters prevent catastrophic forgetting in TTS models
Proposed metric correlates with perceived accent naturalness
Abstract
In this paper we investigate cross-lingual Text-To-Speech (TTS) synthesis through the lens of adapters, in the context of lightweight TTS systems. In particular, we compare the tasks of unseen speaker and language adaptation with the goal of synthesising a target voice in a target language, in which the target voice has no recordings therein. Results from objective evaluations demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information, allowing pre-trained models to learn unseen speaker identities or languages, while avoiding catastrophic forgetting of the original model's speaker or language information. Additionally, to measure how native the generated voices are in terms of accent, we propose and validate an objective metric inspired by mispronunciation detection techniques in second-language (L2) learners. The paper also provides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
