End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks
Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis,, Bj\"orn W. Schuller, Maja Pantic

TL;DR
This paper introduces an end-to-end GAN-based model for video-to-speech synthesis that directly generates realistic speech waveforms from video input without intermediate steps, outperforming previous methods on multiple datasets.
Contribution
The work presents the first end-to-end GAN model for video-to-speech synthesis capable of producing intelligible speech directly from raw video, including in-the-wild scenarios.
Findings
Outperforms previous methods on GRID and LRW datasets
Produces highly realistic and intelligible speech from video
Effective for both constrained and wild datasets
Abstract
Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
