FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
Hanzhao Li, Yuke Li, Xinsheng Wang, Jingbin Hu, Qicong Xie, Shan Yang,, Lei Xie

TL;DR
FleSpeech introduces a multi-stage, multimodal prompt-based speech generation framework that enhances control and flexibility in speech synthesis, accommodating various prompts and user needs.
Contribution
It presents a novel multimodal prompt encoder and a data collection pipeline, enabling more adaptable and precise speech generation compared to existing methods.
Findings
Effective multimodal prompt integration improves speech control.
Experimental results show high-quality, flexible speech synthesis.
Framework supports diverse user scenarios and creative applications.
Abstract
Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, or choosing a style and generating a voice that matches a character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, a novel multi-stage speech generation framework that allows for more flexible manipulation of speech attributes by integrating various forms of control. FleSpeech employs a multimodal prompt encoder that processes and unifies different text, audio, and visual prompts into a cohesive representation. This approach enhances the adaptability of speech synthesis and supports creative and precise control over the generated speech. Additionally, we develop a data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques
