A Multi-Stage Framework for Multimodal Controllable Speech Synthesis
Rui Niu, Weihao Wu, Jie Chen, Long Ma, Zhiyong Wu

TL;DR
This paper introduces a three-stage multimodal controllable speech synthesis framework that improves robustness, diversity, and quality by integrating face and text modalities with advanced training strategies.
Contribution
It presents a novel multi-stage framework utilizing supervised learning and knowledge distillation to enhance multimodal speech synthesis performance.
Findings
Outperforms single-modal baselines in speech quality.
Enhances diversity through combined text-face and text-speech training.
Improves robustness and generalization in face-based synthesis.
Abstract
Controllable speech synthesis aims to control the style of generated speech using reference input, which can be of various modalities. Existing face-based methods struggle with robustness and generalization due to data quality constraints, while text prompt methods offer limited diversity and fine-grained control. Although multimodal approaches aim to integrate various modalities, their reliance on fully matched training data significantly constrains their performance and applicability. This paper proposes a 3-stage multimodal controllable speech synthesis framework to address these challenges. For face encoder, we use supervised learning and knowledge distillation to tackle generalization issues. Furthermore, the text encoder is trained on both text-face and text-speech data to enhance the diversity of the generated speech. Experimental results demonstrate that this method outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Phonetics and Phonology Research
