LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading
Leyuan Qu, Cornelius Weber, Stefan Wermter

TL;DR
This paper introduces LipSound2, a self-supervised model that reconstructs speech from lip videos across multiple languages, significantly improving speech quality and lip reading accuracy without requiring annotated data.
Contribution
LipSound2 is a novel self-supervised framework that leverages audio-visual data for cross-lingual lip-to-speech reconstruction and lip reading, outperforming previous methods.
Findings
Significant improvement in speech quality and intelligibility in English.
Successful transferability to Chinese speech reconstruction.
State-of-the-art lip reading performance on benchmark datasets.
Abstract
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on 2400h multi-lingual (e.g. English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Facial Nerve Paralysis Treatment and Research
