RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker   Verification

Yanfeng Wu; Chenkai Guo; Junan Zhao; Xiao Jin; Jing Xu

arXiv:2108.13249·cs.SD·August 31, 2021

RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification

Yanfeng Wu, Chenkai Guo, Junan Zhao, Xiao Jin, Jing Xu

PDF

Open Access

TL;DR

This paper introduces RSKNet-MTSP, a novel CNN architecture for speaker verification that effectively captures long-term and multi-scale speaker features, and a lightweight version suitable for resource-limited applications, demonstrating significant performance improvements.

Contribution

The paper proposes RSKNet-MTSP with residual selective kernel blocks and multi-time-scale pooling, and a lightweight RSKNet-MTSP-L using depthwise separable convolutions and low-rank factorization, advancing speaker verification models.

Findings

01

RSKNet-MTSP outperforms state-of-the-art methods by 9-26% on public datasets.

02

RSKNet-MTSP-L achieves comparable results with 17-39% fewer parameters.

03

Extensive experiments validate the effectiveness of the proposed architecture.

Abstract

The convolutional neural network (CNN) based approaches have shown great success for speaker verification (SV) tasks, where modeling long temporal context and reducing information loss of speaker characteristics are two important challenges significantly affecting the verification performance. Previous works have introduced dilated convolution and multi-scale aggregation methods to address above challenges. However, such methods are also hard to make full use of some valuable information, which make it difficult to substantially improve the verification performance. To address above issues, we construct a novel CNN-based architecture for SV, called RSKNet-MTSP, where a residual selective kernel block (RSKBlock) and a multiple time-scale statistics pooling (MTSP) module are first proposed. The RSKNet-MTSP can capture both long temporal context and neighbouring information, and gather…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing