SVTS: Scalable Video-to-Speech Synthesis

Rodrigo Mira; Alexandros Haliassos; Stavros Petridis; Bj\"orn W.; Schuller; Maja Pantic

arXiv:2205.02058·cs.SD·August 17, 2022·1 cites

SVTS: Scalable Video-to-Speech Synthesis

Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Bj\"orn W., Schuller, Maja Pantic

PDF

Open Access 2 Repos

TL;DR

This paper presents a scalable video-to-speech synthesis framework that effectively converts silent lip movements into speech, achieving state-of-the-art results on multiple datasets and demonstrating scalability to large, unconstrained data.

Contribution

Introduces a scalable, two-component video-to-speech system that combines spectrogram prediction with a neural vocoder, enabling application to large, diverse datasets.

Findings

01

State-of-the-art results on GRID dataset

02

Significant improvement on LRW dataset

03

First to produce intelligible speech on LRS3 dataset

Abstract

Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Music and Audio Processing