RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

Long-Khanh Pham; Thanh V. T. Tran; Minh-Tan Pham; Van Nguyen

arXiv:2505.22024·cs.SD·May 29, 2025

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

Long-Khanh Pham, Thanh V. T. Tran, Minh-Tan Pham, Van Nguyen

PDF

Open Access

TL;DR

RESOUND is a novel lip-to-speech system that reconstructs natural, expressive speech from silent videos by decomposing linguistic and prosodic features and integrating speech units for improved synthesis.

Contribution

It introduces a source-filter inspired model with separate semantic and acoustic paths, and incorporates speech units to enhance speech reconstruction from silent videos.

Findings

01

Effective across multiple benchmarks

02

Improves speech naturalness and intelligibility

03

Preserves speaker identity

Abstract

Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis