End-to-End Video-To-Speech Synthesis using Generative Adversarial   Networks

Rodrigo Mira; Konstantinos Vougioukas; Pingchuan Ma; Stavros Petridis,; Bj\"orn W. Schuller; Maja Pantic

arXiv:2104.13332·cs.LG·August 17, 2022

End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks

Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis,, Bj\"orn W. Schuller, Maja Pantic

PDF

TL;DR

This paper introduces an end-to-end GAN-based model for video-to-speech synthesis that directly generates realistic speech waveforms from video input without intermediate steps, outperforming previous methods on multiple datasets.

Contribution

The work presents the first end-to-end GAN model for video-to-speech synthesis capable of producing intelligible speech directly from raw video, including in-the-wild scenarios.

Findings

01

Outperforms previous methods on GRID and LRW datasets

02

Produces highly realistic and intelligible speech from video

03

Effective for both constrained and wild datasets

Abstract

Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.