FaceSpeak: Expressive and High-Quality Speech Synthesis from Human   Portraits of Different Styles

Tian-Hao Zhang; Jiawei Zhang; Jun Wang; Xinyuan Qian; Xu-Cheng Yin

arXiv:2501.03181·cs.SD·April 17, 2025

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

Tian-Hao Zhang, Jiawei Zhang, Jun Wang, Xinyuan Qian, Xu-Cheng Yin

PDF

Open Access 1 Video

TL;DR

FaceSpeak is a novel method that synthesizes expressive, high-quality speech from diverse human portraits by extracting key identity and emotional features, overcoming style and data limitations.

Contribution

It introduces FaceSpeak, a new approach for portrait-based speech synthesis that captures identity and emotion from various styles, and provides a curated dataset for multi-modal TTS research.

Findings

01

Synthesizes portrait-aligned speech with high naturalness.

02

Effectively extracts identity and emotional features from diverse images.

03

Demonstrates promising results on the new dataset.

Abstract

Humans can perceive speakers' characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to their voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded their investigations on real-person faces, thereby restricting effective speech synthesis from applying to vast potential usage scenarios with diverse characters and image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity characteristics and emotional representations from a wide variety of image styles. Meanwhile, it mitigates the extraneous information (e.g., background, clothing, and hair color, etc.), resulting in synthesized speech closely aligned with a character's persona. Furthermore, to overcome the scarcity of multi-modal TTS data, we have devised an innovative dataset, namely Expressive Multi-Modal TTS,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles· underline

Taxonomy

TopicsFace recognition and analysis · Emotion and Mood Recognition · Speech Recognition and Synthesis