Text-to-Speech for Unseen Speakers via Low-Complexity Discrete Unit-Based Frame Selection

Ismail Rasim Ulgen; Shreeram Suresh Chandra; Junchen Lu; Berrak Sisman

arXiv:2408.17432·eess.AS·September 18, 2025

Text-to-Speech for Unseen Speakers via Low-Complexity Discrete Unit-Based Frame Selection

Ismail Rasim Ulgen, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman

PDF

Open Access

TL;DR

SelectTTS introduces a low-complexity, frame selection-based approach for synthesizing speech of unseen speakers, matching state-of-the-art quality with significantly fewer parameters and training data, thus enhancing accessibility.

Contribution

The paper presents SelectTTS, a novel method that uses frame selection with SSL features for low-complexity, high-quality multi-speaker TTS of unseen speakers.

Findings

01

Achieves comparable performance to state-of-the-art systems.

02

Requires over 8x fewer parameters.

03

Uses 270x less training data.

Abstract

Synthesizing the voices of unseen speakers remains a persisting challenge in multi-speaker text-to-speech (TTS). Existing methods model speaker characteristics through speaker conditioning during training, leading to increased model complexity and limiting reproducibility and accessibility. A low-complexity alternative would broaden the reach of speech synthesis research, particularly in settings with limited computational and data resources. To this end, we propose SelectTTS, a simple and effective alternative. SelectTTS selects appropriate frames from the target speaker and decodes them using frame-level self-supervised learning (SSL) features. We demonstrate that this approach can effectively capture speaker characteristics for unseen speakers and achieves performance comparable to state-of-the-art multi-speaker TTS frameworks on both objective and subjective metrics. By directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModular Robots and Swarm Intelligence · DNA and Biological Computing