TL;DR
FastLTS is a novel non-autoregressive end-to-end lip-to-speech synthesis model that directly generates high-quality speech from videos, significantly improving speed and quality over existing methods.
Contribution
The paper introduces FastLTS, a non-autoregressive, end-to-end model with a transformer-based visual frontend for unconstrained lip-to-speech synthesis, reducing latency and memory usage.
Findings
Achieves 19.76x faster waveform generation than autoregressive models.
Produces superior speech quality in unconstrained lip-to-speech tasks.
First to use transformer-based visual frontend for this application.
Abstract
Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
