CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for   SSL-Based Speaker Verification

Junyi Peng; Ladislav Mo\v{s}ner; Lin Zhang; Old\v{r}ich Plchot; Themos; Stafylakis; Luk\'a\v{s} Burget; Jan \v{C}ernock\'y

arXiv:2409.15234·eess.AS·September 24, 2024

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

Junyi Peng, Ladislav Mo\v{s}ner, Lin Zhang, Old\v{r}ich Plchot, Themos, Stafylakis, Luk\'a\v{s} Burget, Jan \v{C}ernock\'y

PDF

Open Access

TL;DR

This paper introduces CA-MHFA, a lightweight, context-aware pooling method for SSL-based speaker verification that improves accuracy, efficiency, and generalization across tasks by modeling local temporal dependencies effectively.

Contribution

The paper proposes CA-MHFA, a novel context-aware multi-head attentive pooling framework that enhances SSL-based speaker verification by capturing local dependencies with fewer parameters.

Findings

01

Achieves state-of-the-art EERs on VoxCeleb benchmarks

02

Outperforms complex models like WavLM-TDNN with fewer parameters

03

Demonstrates strong generalization across multiple tasks

Abstract

Self-supervised learning (SSL) models for speaker verification (SV) have gained significant attention in recent years. However, existing SSL-based SV systems often struggle to capture local temporal dependencies and generalize across different tasks. In this paper, we propose context-aware multi-head factorized attentive pooling (CA-MHFA), a lightweight framework that incorporates contextual information from surrounding frames. CA-MHFA leverages grouped, learnable queries to effectively model contextual dependencies while maintaining efficiency by sharing keys and values across groups. Experimental results on the VoxCeleb dataset show that CA-MHFA achieves EERs of 0.42\%, 0.48\%, and 0.96\% on Vox1-O, Vox1-E, and Vox1-H, respectively, outperforming complex models like WavLM-TDNN with fewer parameters and faster convergence. Additionally, CA-MHFA demonstrates strong generalization across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need