SVTS: Scalable Video-to-Speech Synthesis
Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Bj\"orn W., Schuller, Maja Pantic

TL;DR
This paper presents a scalable video-to-speech synthesis framework that effectively converts silent lip movements into speech, achieving state-of-the-art results on multiple datasets and demonstrating scalability to large, unconstrained data.
Contribution
Introduces a scalable, two-component video-to-speech system that combines spectrogram prediction with a neural vocoder, enabling application to large, diverse datasets.
Findings
State-of-the-art results on GRID dataset
Significant improvement on LRW dataset
First to produce intelligible speech on LRS3 dataset
Abstract
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Music and Audio Processing
