SpeakerNet: 1D Depth-wise Separable Convolutional Network for   Text-Independent Speaker Recognition and Verification

Nithin Rao Koluguri; Jason Li; Vitaly Lavrukhin; Boris Ginsburg

arXiv:2010.12653·eess.AS·October 27, 2020·29 cites

SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification

Nithin Rao Koluguri, Jason Li, Vitaly Lavrukhin, Boris Ginsburg

PDF

Open Access

TL;DR

SpeakerNet introduces a lightweight neural network architecture utilizing 1D depth-wise separable convolutions and x-vector pooling for effective text-independent speaker recognition and verification, achieving near state-of-the-art accuracy without VAD.

Contribution

The paper presents a novel residual network architecture with depth-wise separable convolutions and a simple pooling method, enabling high performance with fewer parameters and no VAD.

Findings

01

Achieves EER of 2.10% on VoxCeleb1 cleaned data.

02

Uses only 5 million parameters in the lightweight model.

03

Does not require voice activity detection (VAD).

Abstract

We propose SpeakerNet - a new neural architecture for speaker recognition and speaker verification tasks. It is composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, and ReLU layers. This architecture uses x-vector based statistics pooling layer to map variable-length utterances to a fixed-length embedding (q-vector). SpeakerNet-M is a simple lightweight model with just 5M parameters. It doesn't use voice activity detection (VAD) and achieves close to state-of-the-art performance scoring an Equal Error Rate (EER) of 2.10% on the VoxCeleb1 cleaned and 2.29% on the VoxCeleb1 trial files.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing