FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech

Linhan Ma; Dake Guo; He Wang; Jin Xu; Lei Xie

arXiv:2505.05159·eess.AS·May 16, 2025

FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech

Linhan Ma, Dake Guo, He Wang, Jin Xu, Lei Xie

PDF

Open Access

TL;DR

FlexSpeech introduces a novel TTS model that combines autoregressive duration prediction with non-autoregressive acoustic modeling to achieve state-of-the-art stability, naturalness, and rapid style transfer in speech synthesis.

Contribution

The paper presents a hybrid TTS framework that explicitly models phonetic durations for stability and employs a lightweight, optimized duration predictor for style transfer, enhancing naturalness and controllability.

Findings

01

Achieves state-of-the-art stability and naturalness in zero-shot TTS.

02

Enables rapid style transfer with minimal data (~100 samples).

03

Maintains stability and quality without retraining the acoustic model.

Abstract

Current speech generation research can be categorized into two primary classes: non-autoregressive and autoregressive. The fundamental distinction between these approaches lies in the duration prediction strategy employed for predictable-length sequences. The NAR methods ensure stability in speech generation by explicitly and independently modeling the duration of each phonetic unit. Conversely, AR methods employ an autoregressive paradigm to predict the compressed speech token by implicitly modeling duration with Markov properties. Although this approach improves prosody, it does not provide the structural guarantees necessary for stability. To simultaneously address the issues of stability and naturalness in speech generation, we propose FlexSpeech, a stable, controllable, and expressive TTS model. The motivation behind FlexSpeech is to incorporate Markov dependencies and preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Face recognition and analysis · Music Technology and Sound Studies