FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
Linhan Ma, Dake Guo, He Wang, Jin Xu, Lei Xie

TL;DR
FlexSpeech introduces a novel TTS model that combines autoregressive duration prediction with non-autoregressive acoustic modeling to achieve state-of-the-art stability, naturalness, and rapid style transfer in speech synthesis.
Contribution
The paper presents a hybrid TTS framework that explicitly models phonetic durations for stability and employs a lightweight, optimized duration predictor for style transfer, enhancing naturalness and controllability.
Findings
Achieves state-of-the-art stability and naturalness in zero-shot TTS.
Enables rapid style transfer with minimal data (~100 samples).
Maintains stability and quality without retraining the acoustic model.
Abstract
Current speech generation research can be categorized into two primary classes: non-autoregressive and autoregressive. The fundamental distinction between these approaches lies in the duration prediction strategy employed for predictable-length sequences. The NAR methods ensure stability in speech generation by explicitly and independently modeling the duration of each phonetic unit. Conversely, AR methods employ an autoregressive paradigm to predict the compressed speech token by implicitly modeling duration with Markov properties. Although this approach improves prosody, it does not provide the structural guarantees necessary for stability. To simultaneously address the issues of stability and naturalness in speech generation, we propose FlexSpeech, a stable, controllable, and expressive TTS model. The motivation behind FlexSpeech is to incorporate Markov dependencies and preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Face recognition and analysis · Music Technology and Sound Studies
