Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Bowen Shi; Abdelrahman Mohamed; Wei-Ning Hsu

arXiv:2205.07180·eess.AS·July 18, 2022·1 cites

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu

PDF

Open Access 1 Repo

TL;DR

This paper explores self-supervised pre-training of audio-visual speaker embeddings using AV-HuBERT, demonstrating significant improvements in speaker verification accuracy and noise robustness by leveraging lip-based visual information.

Contribution

It introduces the application of AV-HuBERT for lip-based audio-visual speaker embedding pre-training, enhancing performance and noise robustness in speaker verification tasks.

Findings

01

AV-HuBERT improves speaker verification label efficiency by tenfold.

02

Incorporating lip visual data reduces EER by 38% in clean and 75% in noisy conditions.

03

Lip-based visual information significantly enhances noise robustness.

Abstract

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/av_hubert
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Softmax · Attention Dropout · Layer Normalization · Dropout · Dense Connections · Adam