Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations
Se-Yun Um, Jihyun Kim, Jihyun Lee, and Hong-Goo Kang

TL;DR
This paper introduces Facetron, a multi-speaker face-to-speech model that converts face images into speech waveforms using cross-modal latent representations, capable of handling unseen speakers with high quality.
Contribution
The paper presents a novel end-to-end GAN-based face-to-speech model utilizing independent linguistic and speaker features for flexible multi-speaker synthesis, including unseen speakers.
Findings
Outperforms conventional methods in objective and subjective evaluations.
Achieves high accuracy in lip-reading-based linguistic feature extraction.
Produces natural-sounding speech with high speaker and gender similarity.
Abstract
In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
