Cross-lingual Low Resource Speaker Adaptation Using Phonological   Features

Georgia Maniati; Nikolaos Ellinas; Konstantinos Markopoulos; Georgios; Vamvoukakis; June Sig Sung; Hyoungmin Park; Aimilios Chalamandaris; Pirros; Tsiakoulis

arXiv:2111.09075·cs.SD·November 18, 2021

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios, Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros, Tsiakoulis

PDF

TL;DR

This paper presents a phonological feature-based multilingual TTS model that enables effective cross-lingual speaker adaptation with very limited data, achieving high naturalness and speaker similarity even in few-shot scenarios.

Contribution

It introduces a language-agnostic multispeaker TTS model conditioned on phonological features, enabling cross-lingual adaptation with minimal data and demonstrating few-shot learning capabilities.

Findings

01

High speaker similarity with as few as 8 utterances.

02

Model performs well in zero-shot and few-shot adaptation scenarios.

03

Effective across multiple language pairs with phonological features.

Abstract

The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages, with the goal of achieving cross-lingual speaker adaptation. We first experiment with the effect of language phonological similarity on cross-lingual TTS of several source-target language combinations. Subsequently, we fine-tune the model with very limited data of a new speaker's voice in either a seen or an unseen language, and achieve synthetic speech of equal quality, while preserving the target speaker's identity. With as few as 32…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.