Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

Minsu Kim; Pingchuan Ma; Honglie Chen; Stavros Petridis; Maja Pantic

arXiv:2505.18972·eess.AS·May 27, 2025

Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic

PDF

Open Access

TL;DR

This paper presents a multi-modal controllable TTS system that synthesizes voices from face images with adjustable speech characteristics, overcoming data quality issues and enabling artistic and diverse voice generation.

Contribution

It introduces a novel training method combining high-quality audio data, stylization for artistic faces, and sampling-based decoding with prompting for diverse, consistent voice synthesis.

Findings

01

Effective face-driven voice synthesis validated by experiments

02

Enhanced voice quality using high-quality audio corpora

03

Ability to generate artistic and diverse voices

Abstract

This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Speech Recognition and Synthesis