Optimizing Multi-Taper Features for Deep Speaker Verification

Xuechen Liu; Md Sahidullah; Tomi Kinnunen

arXiv:2110.10983·cs.SD·October 27, 2021

Optimizing Multi-Taper Features for Deep Speaker Verification

Xuechen Liu, Md Sahidullah, Tomi Kinnunen

PDF

TL;DR

This paper introduces an optimized multi-taper feature extraction method for deep speaker verification, significantly improving robustness and accuracy over traditional static-taper approaches.

Contribution

It proposes jointly optimizing multi-taper estimators with deep neural networks for speaker verification, a novel approach that enhances performance.

Findings

01

Up to 25.8% EER reduction on SITW corpus

02

Improved robustness and balanced leakage-variance trade-off

03

Demonstrated effectiveness of joint optimization in deep ASV

Abstract

Multi-taper estimators provide low-variance power spectrum estimates that can be used in place of the windowed discrete Fourier transform (DFT) to extract speech features such as mel-frequency cepstral coefficients (MFCCs). Even if past work has reported promising automatic speaker verification (ASV) results with Gaussian mixture model-based classifiers, the performance of multi-taper MFCCs with deep ASV systems remains an open question. Instead of a static-taper design, we propose to optimize the multi-taper estimator jointly with a deep neural network trained for ASV tasks. With a maximum improvement on the SITW corpus of 25.8% in terms of equal error rate over the static-taper, our method helps preserve a balanced level of leakage and variance, providing more robustness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.