Streaming end-to-end multi-talker speech recognition

Liang Lu; Naoyuki Kanda; Jinyu Li; Yifan Gong

arXiv:2011.13148·cs.SD·May 12, 2021

Streaming end-to-end multi-talker speech recognition

Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong

PDF

TL;DR

This paper introduces SURT, a streaming end-to-end multi-talker speech recognition model that operates with low latency, outperforming offline models in accuracy on the LibriSpeechMix dataset.

Contribution

The paper proposes the SURT model with novel architectures and training methods for real-time multi-talker speech recognition, addressing the gap in streaming solutions.

Findings

01

HEAT training outperforms PIT in accuracy

02

SURT achieves comparable accuracy to offline models with 150ms latency

03

SURT demonstrates effective real-time multi-talker recognition

Abstract

End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.