FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles
Tian-Hao Zhang, Jiawei Zhang, Jun Wang, Xinyuan Qian, Xu-Cheng Yin

TL;DR
FaceSpeak is a novel method that synthesizes expressive, high-quality speech from diverse human portraits by extracting key identity and emotional features, overcoming style and data limitations.
Contribution
It introduces FaceSpeak, a new approach for portrait-based speech synthesis that captures identity and emotion from various styles, and provides a curated dataset for multi-modal TTS research.
Findings
Synthesizes portrait-aligned speech with high naturalness.
Effectively extracts identity and emotional features from diverse images.
Demonstrates promising results on the new dataset.
Abstract
Humans can perceive speakers' characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to their voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded their investigations on real-person faces, thereby restricting effective speech synthesis from applying to vast potential usage scenarios with diverse characters and image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity characteristics and emotional representations from a wide variety of image styles. Meanwhile, it mitigates the extraneous information (e.g., background, clothing, and hair color, etc.), resulting in synthesized speech closely aligned with a character's persona. Furthermore, to overcome the scarcity of multi-modal TTS data, we have devised an innovative dataset, namely Expressive Multi-Modal TTS,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFace recognition and analysis · Emotion and Mood Recognition · Speech Recognition and Synthesis
