CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification
Junyi Peng, Ladislav Mo\v{s}ner, Lin Zhang, Old\v{r}ich Plchot, Themos, Stafylakis, Luk\'a\v{s} Burget, Jan \v{C}ernock\'y

TL;DR
This paper introduces CA-MHFA, a lightweight, context-aware pooling method for SSL-based speaker verification that improves accuracy, efficiency, and generalization across tasks by modeling local temporal dependencies effectively.
Contribution
The paper proposes CA-MHFA, a novel context-aware multi-head attentive pooling framework that enhances SSL-based speaker verification by capturing local dependencies with fewer parameters.
Findings
Achieves state-of-the-art EERs on VoxCeleb benchmarks
Outperforms complex models like WavLM-TDNN with fewer parameters
Demonstrates strong generalization across multiple tasks
Abstract
Self-supervised learning (SSL) models for speaker verification (SV) have gained significant attention in recent years. However, existing SSL-based SV systems often struggle to capture local temporal dependencies and generalize across different tasks. In this paper, we propose context-aware multi-head factorized attentive pooling (CA-MHFA), a lightweight framework that incorporates contextual information from surrounding frames. CA-MHFA leverages grouped, learnable queries to effectively model contextual dependencies while maintaining efficiency by sharing keys and values across groups. Experimental results on the VoxCeleb dataset show that CA-MHFA achieves EERs of 0.42\%, 0.48\%, and 0.96\% on Vox1-O, Vox1-E, and Vox1-H, respectively, outperforming complex models like WavLM-TDNN with fewer parameters and faster convergence. Additionally, CA-MHFA demonstrates strong generalization across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSoftmax · Attention Is All You Need
