Multi-talker ASR for an unknown number of sources: Joint training of   source counting, separation and ASR

Thilo von Neumann; Christoph Boeddeker; Lukas Drude; Keisuke; Kinoshita; Marc Delcroix; Tomohiro Nakatani; Reinhold Haeb-Umbach

arXiv:2006.02786·eess.AS·December 22, 2020

Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke, Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

PDF

TL;DR

This paper introduces an end-to-end multi-talker ASR system capable of handling an unknown number of speakers by jointly performing source counting, separation, and recognition, achieving state-of-the-art results.

Contribution

It presents the first system that combines source counting, separation, and recognition for an unknown number of speakers in an end-to-end framework.

Findings

01

High counting accuracy in simulated mixtures

02

State-of-the-art word error rate on WSJ0-2mix

03

Good generalization to more speakers than seen during training

Abstract

Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our system generalizes well to a larger number of speakers than it ever saw during training, as shown in experiments with the WSJ0-4mix database.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.