Deep Speaker: an End-to-End Neural Speaker Embedding System

Chao Li; Xiaokong Ma; Bing Jiang; Xiangang Li; Xuewei Zhang; Xiao Liu,; Ying Cao; Ajay Kannan; Zhenyao Zhu

arXiv:1705.02304·cs.CL·May 8, 2017·427 cites

Deep Speaker: an End-to-End Neural Speaker Embedding System

Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu,, Ying Cao, Ajay Kannan, Zhenyao Zhu

PDF

Open Access 5 Repos

TL;DR

Deep Speaker introduces an end-to-end neural system for speaker embedding that significantly improves speaker verification and identification accuracy over traditional methods, using triplet loss and neural architectures.

Contribution

It presents a novel neural speaker embedding system with end-to-end training, outperforming traditional i-vector baselines in speaker recognition tasks.

Findings

01

Reduces verification EER by 50%

02

Improves identification accuracy by 60%

03

Adapting from Mandarin-trained models enhances English speaker recognition

Abstract

We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering. We experiment with ResCNN and GRU architectures to extract the acoustic features, then mean pool to produce utterance-level speaker embeddings, and train using triplet loss based on cosine similarity. Experiments on three distinct datasets suggest that Deep Speaker outperforms a DNN-based i-vector baseline. For example, Deep Speaker reduces the verification equal error rate by 50% (relatively) and improves the identification accuracy by 60% (relatively) on a text-independent dataset. We also present results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsGated Recurrent Unit