Identify Speakers in Cocktail Parties with End-to-End Attention

Junzhe Zhu; Mark Hasegawa-Johnson; Leda Sari

arXiv:2005.11408·eess.AS·August 11, 2020

Identify Speakers in Cocktail Parties with End-to-End Attention

Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sari

PDF

Open Access 1 Repo

TL;DR

This paper introduces an end-to-end attention-based system for speaker identification in overlapping speech scenarios, achieving high accuracy in multi-speaker environments by jointly optimizing source extraction and identification.

Contribution

The paper proposes a novel end-to-end model with residual attention and dilated convolution for joint speaker extraction and identification, improving accuracy in multi-speaker recordings.

Findings

01

Achieves 99.9% accuracy for single speaker in two-speaker mixtures.

02

Attains 93.9% accuracy for both speakers in two-speaker scenarios.

03

Recognizes all speakers in three-speaker scenarios with 81.2% accuracy.

Abstract

In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately. This paper presents an end-to-end system that integrates speech source extraction and speaker identification, and proposes a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension. Residual attention permits us to learn spectrogram masks that are optimized for the purpose of speaker identification, while residual forward connections permit dilated convolution with a sufficiently large context window to guarantee correct streaming across syllable boundaries. End-to-end training results in a system that recognizes one speaker in a two-speaker broadcast speech mixture with 99.9% accuracy and both speakers with 93.9% accuracy, and that recognizes all speakers in three-speaker scenarios with 81.2% accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JunzheJosephZhu/Identify-Speakers-in-Cocktail-Parties-with-E2E-Attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsConvolution · Dilated Convolution