RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification
Yanfeng Wu, Chenkai Guo, Junan Zhao, Xiao Jin, Jing Xu

TL;DR
This paper introduces RSKNet-MTSP, a novel CNN architecture for speaker verification that effectively captures long-term and multi-scale speaker features, and a lightweight version suitable for resource-limited applications, demonstrating significant performance improvements.
Contribution
The paper proposes RSKNet-MTSP with residual selective kernel blocks and multi-time-scale pooling, and a lightweight RSKNet-MTSP-L using depthwise separable convolutions and low-rank factorization, advancing speaker verification models.
Findings
RSKNet-MTSP outperforms state-of-the-art methods by 9-26% on public datasets.
RSKNet-MTSP-L achieves comparable results with 17-39% fewer parameters.
Extensive experiments validate the effectiveness of the proposed architecture.
Abstract
The convolutional neural network (CNN) based approaches have shown great success for speaker verification (SV) tasks, where modeling long temporal context and reducing information loss of speaker characteristics are two important challenges significantly affecting the verification performance. Previous works have introduced dilated convolution and multi-scale aggregation methods to address above challenges. However, such methods are also hard to make full use of some valuable information, which make it difficult to substantially improve the verification performance. To address above issues, we construct a novel CNN-based architecture for SV, called RSKNet-MTSP, where a residual selective kernel block (RSKBlock) and a multiple time-scale statistics pooling (MTSP) module are first proposed. The RSKNet-MTSP can capture both long temporal context and neighbouring information, and gather…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
