TL;DR
This paper introduces a cross-modal co-learning framework leveraging audio-visual correlations and Transformer-based modules to improve text-independent speaker verification, achieving significant performance gains over traditional methods.
Contribution
It proposes a novel cross-modal co-learning paradigm with Transformer-based modality alignment for speaker verification, utilizing synchronized audio-visual data.
Findings
Achieves 60% relative improvement over audio-only systems.
Achieves 20% relative improvement over baseline fusion systems.
Demonstrates effectiveness across multiple datasets.
Abstract
Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Dropout · Byte Pair Encoding · Adam · Multi-Head Attention · Residual Connection · Layer Normalization · Softmax · Label Smoothing
