Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu

TL;DR
This paper explores self-supervised pre-training of audio-visual speaker embeddings using AV-HuBERT, demonstrating significant improvements in speaker verification accuracy and noise robustness by leveraging lip-based visual information.
Contribution
It introduces the application of AV-HuBERT for lip-based audio-visual speaker embedding pre-training, enhancing performance and noise robustness in speaker verification tasks.
Findings
AV-HuBERT improves speaker verification label efficiency by tenfold.
Incorporating lip visual data reduces EER by 38% in clean and 75% in noisy conditions.
Lip-based visual information significantly enhances noise robustness.
Abstract
This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Softmax · Attention Dropout · Layer Normalization · Dropout · Dense Connections · Adam
