Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters

Alessio Falai; Ziyao Zhang; Akos Gangoly

arXiv:2508.18006·eess.AS·August 26, 2025

Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters

Alessio Falai, Ziyao Zhang, Akos Gangoly

PDF

TL;DR

This paper explores the use of adapters in lightweight cross-lingual TTS systems to enable unseen speaker and language adaptation, maintaining original model capabilities and introducing an objective accent similarity metric.

Contribution

It demonstrates the effectiveness of adapters for unseen speaker and language adaptation in lightweight TTS and proposes a new metric for accent naturalness evaluation.

Findings

01

Adapters effectively learn speaker and language-specific info

02

Adapters prevent catastrophic forgetting in TTS models

03

Proposed metric correlates with perceived accent naturalness

Abstract

In this paper we investigate cross-lingual Text-To-Speech (TTS) synthesis through the lens of adapters, in the context of lightweight TTS systems. In particular, we compare the tasks of unseen speaker and language adaptation with the goal of synthesising a target voice in a target language, in which the target voice has no recordings therein. Results from objective evaluations demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information, allowing pre-trained models to learn unseen speaker identities or languages, while avoiding catastrophic forgetting of the original model's speaker or language information. Additionally, to measure how native the generated voices are in terms of accent, we propose and validate an objective metric inspired by mispronunciation detection techniques in second-language (L2) learners. The paper also provides…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.