NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis   with Differential Digital Signal Processing

Yifan Liang; Fangkun Liu; Andong Li; Xiaodong Li; Chengshi Zheng

arXiv:2502.12002·cs.SD·February 18, 2025

NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing

Yifan Liang, Fangkun Liu, Andong Li, Xiaodong Li, Chengshi Zheng

PDF

Open Access

TL;DR

NaturalL2S is an end-to-end lip-to-speech synthesis framework that integrates acoustic biases and differentiable signal processing to improve speech quality without relying on mel-spectrogram intermediates.

Contribution

It introduces a novel end-to-end approach combining F0 prediction and DDSP for high-quality lip-to-speech synthesis, reducing domain gap issues.

Findings

01

Outperforms state-of-the-art methods in speech quality

02

Effectively captures prosody without explicit speaker modeling

03

Enhances synthesis intelligibility and naturalness

Abstract

Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits of leveraging VSR models. However, these methods typically rely on mel-spectrograms as an intermediate representation, which may introduce a key bottleneck: the domain gap between synthetic mel-spectrograms, generated from inherently error-prone lip-to-speech mappings, and real mel-spectrograms used to train vocoders. This mismatch inevitably degrades synthesis quality. To bridge this gap, we propose Natural Lip-to-Speech (NaturalL2S), an end-to-end framework integrating acoustic inductive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis