kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech
Karl El Hajal, Ajinkya Kulkarni, Enno Hermann, Mathew Magimai.-Doss

TL;DR
kNN-TTS is a simple zero-shot multi-speaker TTS framework that uses SSL feature retrieval, achieving competitive results with minimal training data and enabling voice morphing.
Contribution
The paper introduces kNN-TTS, a retrieval-based zero-shot multi-speaker TTS method that requires only single-speaker transcribed data and leverages SSL features for effective voice synthesis.
Findings
Achieves comparable performance to state-of-the-art models trained on larger datasets.
Requires minimal training data, suitable for low-resource languages.
Enables fine-grained voice morphing through an interpolation parameter.
Abstract
While recent zero-shot multi-speaker text-to-speech (TTS) models achieve impressive results, they typically rely on extensive transcribed speech datasets from numerous speakers and intricate training pipelines. Meanwhile, self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. Further, SSL features from different speakers that are linearly close share phonetic information while maintaining individual speaker identity. In this study, we introduce kNN-TTS, a simple and effective framework for zero-shot multi-speaker TTS using retrieval methods which leverage the linear relationships between SSL features. Objective and subjective evaluations show that our models, trained on transcribed speech from a single speaker only, achieve performance comparable to state-of-the-art models that are trained on significantly larger training datasets.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
