Learning Audio-Visual Speech Representation by Masked Multimodal Cluster   Prediction

Bowen Shi; Wei-Ning Hsu; Kushal Lakhotia; Abdelrahman Mohamed

arXiv:2201.02184·eess.AS·March 15, 2022·113 cites

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed

PDF

Open Access 2 Repos 10 Models 1 Video

TL;DR

This paper introduces AV-HuBERT, a self-supervised framework that learns audio-visual speech representations by predicting masked multimodal hidden units, significantly improving lip-reading and speech recognition performance with less labeled data.

Contribution

The paper presents AV-HuBERT, a novel self-supervised learning method that leverages masked multimodal inputs and predicted hidden units for superior audio-visual speech representation.

Findings

01

Achieves 32.5% WER on LRS3 with only 30 hours of labeled data.

02

Outperforms previous state-of-the-art with much less labeled data.

03

Reduces WER to 26.9% with full labeled dataset and self-training.

Abstract

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Video Analysis and Summarization

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Layer Normalization · WordPiece · Multi-Head Attention