MASV: Speaker Verification with Global and Local Context Mamba

Yang Liu; Li Wan; Yiteng Huang; Ming Sun; Yangyang Shi; Florian Metze

arXiv:2412.10989·eess.AS·December 17, 2024

MASV: Speaker Verification with Global and Local Context Mamba

Yang Liu, Li Wan, Yiteng Huang, Ming Sun, Yangyang Shi, Florian Metze

PDF

Open Access

TL;DR

The paper introduces MASV, a novel speech verification model that combines local and global context modeling within an efficient architecture, outperforming existing methods in accuracy and computational efficiency.

Contribution

MASV integrates the Mamba module into ECAPA-TDNN, effectively capturing global and local context for improved speech verification performance.

Findings

01

Surpasses existing models in verification accuracy

02

Achieves better efficiency compared to transformer-based approaches

03

Effectively models long-sequence audio with Mamba modules

Abstract

Deep learning models like Convolutional Neural Networks and transformers have shown impressive capabilities in speech verification, gaining considerable attention in the research community. However, CNN-based approaches struggle with modeling long-sequence audio effectively, resulting in suboptimal verification performance. On the other hand, transformer-based methods are often hindered by high computational demands, limiting their practicality. This paper presents the MASV model, a novel architecture that integrates the Mamba module into the ECAPA-TDNN framework. By introducing the Local Context Bidirectional Mamba and Tri-Mamba block, the model effectively captures both global and local context within audio sequences. Experimental results demonstrate that the MASV model substantially enhances verification performance, surpassing existing models in both accuracy and efficiency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces