Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed

TL;DR
This paper introduces AV-HuBERT, a self-supervised framework that learns audio-visual speech representations by predicting masked multimodal hidden units, significantly improving lip-reading and speech recognition performance with less labeled data.
Contribution
The paper presents AV-HuBERT, a novel self-supervised learning method that leverages masked multimodal inputs and predicted hidden units for superior audio-visual speech representation.
Findings
Achieves 32.5% WER on LRS3 with only 30 hours of labeled data.
Outperforms previous state-of-the-art with much less labeled data.
Reduces WER to 26.9% with full labeled dataset and self-training.
Abstract
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗vumichien/AV-HuBERTmodel· ♡ 13♡ 13
- 🤗enactic/avista-large-plus-v2model· 2 dl· ♡ 22 dl♡ 2
- 🤗enactic/avista-large-v2model
- 🤗enactic/avista-base-v2model
- 🤗enactic/avista-base-plus-v2model· 2 dl· ♡ 12 dl♡ 1
- 🤗enactic/avista-base-plusmodel· 2 dl2 dl
- 🤗enactic/avista-basemodel· 1 dl· ♡ 11 dl♡ 1
- 🤗enactic/avista-largemodel· 3 dl· ♡ 13 dl♡ 1
- 🤗enactic/avista-large-plusmodel
- 🤗enactic/japanese-avhubert-base-iter1model
Videos
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Video Analysis and Summarization
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Layer Normalization · WordPiece · Multi-Head Attention
