Audio-Visual Speaker Verification via Joint Cross-Attention
R. Gnana Praveen, Jahangir Alam

TL;DR
This paper introduces a novel cross-attention based method for audio-visual speaker verification, leveraging inter- and intra-modal relationships to improve accuracy over existing fusion techniques.
Contribution
It proposes a cross-modal joint attention mechanism that fully exploits inter- and intra-modal information for enhanced speaker verification performance.
Findings
Significantly outperforms state-of-the-art methods on Voxceleb1.
Effectively captures inter- and intra-modal relationships.
Improves accuracy of audio-visual speaker verification.
Abstract
Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audio-visual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Image and Signal Denoising Methods
