FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech   Synthesis

Yongqi Wang; Zhou Zhao

arXiv:2207.03800·cs.SD·July 14, 2022

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang, Zhou Zhao

PDF

1 Repo

TL;DR

FastLTS is a novel non-autoregressive end-to-end lip-to-speech synthesis model that directly generates high-quality speech from videos, significantly improving speed and quality over existing methods.

Contribution

The paper introduces FastLTS, a non-autoregressive, end-to-end model with a transformer-based visual frontend for unconstrained lip-to-speech synthesis, reducing latency and memory usage.

Findings

01

Achieves 19.76x faster waveform generation than autoregressive models.

02

Produces superior speech quality in unconstrained lip-to-speech tasks.

03

First to use transformer-based visual frontend for this application.

Abstract

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cyanbx/FastLTS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings