Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping
Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Haithem Boussaid,, Ebtessam Almazrouei, Merouane Debbah

TL;DR
Lip2Vec introduces a simple, prior-based visual speech recognition method that maps lip video features to audio representations, enabling effective speech decoding with less reliance on extensive labeled data and improving out-of-distribution performance.
Contribution
The paper presents Lip2Vec, a novel approach that learns a latent-to-latent mapping from visual to audio representations, reducing complexity and reliance on labeled data in VSR.
Findings
Achieves 26 WER on LRS3 dataset.
Maintains reasonable performance on VoxCeleb test set.
Outperforms fully-supervised methods in certain scenarios.
Abstract
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Indoor and Outdoor Localization Technologies · Face recognition and analysis
