Cross-modal Audio-visual Co-learning for Text-independent Speaker   Verification

Meng Liu; Kong Aik Lee; Longbiao Wang; Hanyi Zhang; Chang Zeng; Jianwu; Dang

arXiv:2302.11254·cs.SD·February 23, 2023

Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu, Dang

PDF

1 Repo

TL;DR

This paper introduces a cross-modal co-learning framework leveraging audio-visual correlations and Transformer-based modules to improve text-independent speaker verification, achieving significant performance gains over traditional methods.

Contribution

It proposes a novel cross-modal co-learning paradigm with Transformer-based modality alignment for speaker verification, utilizing synchronized audio-visual data.

Findings

01

Achieves 60% relative improvement over audio-only systems.

02

Achieves 20% relative improvement over baseline fusion systems.

03

Demonstrates effectiveness across multiple datasets.

Abstract

Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danielmengliu/audiovisuallip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Dropout · Byte Pair Encoding · Adam · Multi-Head Attention · Residual Connection · Layer Normalization · Softmax · Label Smoothing