Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal   Latent Representations

Se-Yun Um; Jihyun Kim; Jihyun Lee; and Hong-Goo Kang

arXiv:2107.12003·cs.CV·March 16, 2023

Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

Se-Yun Um, Jihyun Kim, Jihyun Lee, and Hong-Goo Kang

PDF

Open Access

TL;DR

This paper introduces Facetron, a multi-speaker face-to-speech model that converts face images into speech waveforms using cross-modal latent representations, capable of handling unseen speakers with high quality.

Contribution

The paper presents a novel end-to-end GAN-based face-to-speech model utilizing independent linguistic and speaker features for flexible multi-speaker synthesis, including unseen speakers.

Findings

01

Outperforms conventional methods in objective and subjective evaluations.

02

Achieves high accuracy in lip-reading-based linguistic feature extraction.

03

Produces natural-sounding speech with high speaker and gender similarity.

Abstract

In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis