kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker   Text-to-Speech

Karl El Hajal; Ajinkya Kulkarni; Enno Hermann; Mathew Magimai.-Doss

arXiv:2408.10771·eess.AS·February 4, 2025

kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech

Karl El Hajal, Ajinkya Kulkarni, Enno Hermann, Mathew Magimai.-Doss

PDF

Open Access 1 Models 1 Video

TL;DR

kNN-TTS is a simple zero-shot multi-speaker TTS framework that uses SSL feature retrieval, achieving competitive results with minimal training data and enabling voice morphing.

Contribution

The paper introduces kNN-TTS, a retrieval-based zero-shot multi-speaker TTS method that requires only single-speaker transcribed data and leverages SSL features for effective voice synthesis.

Findings

01

Achieves comparable performance to state-of-the-art models trained on larger datasets.

02

Requires minimal training data, suitable for low-resource languages.

03

Enables fine-grained voice morphing through an interpolation parameter.

Abstract

While recent zero-shot multi-speaker text-to-speech (TTS) models achieve impressive results, they typically rely on extensive transcribed speech datasets from numerous speakers and intricate training pipelines. Meanwhile, self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. Further, SSL features from different speakers that are linearly close share phonetic information while maintaining individual speaker identity. In this study, we introduce kNN-TTS, a simple and effective framework for zero-shot multi-speaker TTS using retrieval methods which leverage the linear relationships between SSL features. Objective and subjective evaluations show that our models, trained on transcribed speech from a single speaker only, achieve performance comparable to state-of-the-art models that are trained on significantly larger training datasets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Idiap/kNN-TTS
model· 7 dl· ♡ 4
7 dl♡ 4

Videos

kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling