Text-to-Speech for Unseen Speakers via Low-Complexity Discrete Unit-Based Frame Selection
Ismail Rasim Ulgen, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman

TL;DR
SelectTTS introduces a low-complexity, frame selection-based approach for synthesizing speech of unseen speakers, matching state-of-the-art quality with significantly fewer parameters and training data, thus enhancing accessibility.
Contribution
The paper presents SelectTTS, a novel method that uses frame selection with SSL features for low-complexity, high-quality multi-speaker TTS of unseen speakers.
Findings
Achieves comparable performance to state-of-the-art systems.
Requires over 8x fewer parameters.
Uses 270x less training data.
Abstract
Synthesizing the voices of unseen speakers remains a persisting challenge in multi-speaker text-to-speech (TTS). Existing methods model speaker characteristics through speaker conditioning during training, leading to increased model complexity and limiting reproducibility and accessibility. A low-complexity alternative would broaden the reach of speech synthesis research, particularly in settings with limited computational and data resources. To this end, we propose SelectTTS, a simple and effective alternative. SelectTTS selects appropriate frames from the target speaker and decodes them using frame-level self-supervised learning (SSL) features. We demonstrate that this approach can effectively capture speaker characteristics for unseen speakers and achieves performance comparable to state-of-the-art multi-speaker TTS frameworks on both objective and subjective metrics. By directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · DNA and Biological Computing
