Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic

TL;DR
This paper presents a multi-modal controllable TTS system that synthesizes voices from face images with adjustable speech characteristics, overcoming data quality issues and enabling artistic and diverse voice generation.
Contribution
It introduces a novel training method combining high-quality audio data, stylization for artistic faces, and sampling-based decoding with prompting for diverse, consistent voice synthesis.
Findings
Effective face-driven voice synthesis validated by experiments
Enhanced voice quality using high-quality audio corpora
Ability to generate artistic and diverse voices
Abstract
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Speech Recognition and Synthesis
