SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Yifan Liang; Andong Li; Kang Yang; Guochen Yu; Fangkun Liu; Lingling Dai; Xiaodong Li; Chengshi Zheng

arXiv:2602.11477·eess.AS·February 13, 2026

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Yifan Liang, Andong Li, Kang Yang, Guochen Yu, Fangkun Liu, Lingling Dai, Xiaodong Li, Chengshi Zheng

PDF

Open Access

TL;DR

SLD-L2S introduces a hierarchical latent diffusion framework for lip-to-speech synthesis that directly maps lip movements to speech latent space, surpassing traditional intermediate representations and achieving state-of-the-art results.

Contribution

The paper presents a novel hierarchical subspace latent diffusion model that directly generates speech from lip movements, avoiding information loss from intermediate representations.

Findings

01

Achieves state-of-the-art synthesis quality on benchmarks.

02

Outperforms existing methods in objective evaluations.

03

Improves speech naturalness and intelligibility.

Abstract

Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Hearing Loss and Rehabilitation