Multi-turn RNN-T for streaming recognition of multi-party speech

Ilya Sklyar; Anna Piunova; Xianrui Zheng; Yulan Liu

arXiv:2112.10200·eess.AS·February 11, 2022

Multi-turn RNN-T for streaming recognition of multi-party speech

Ilya Sklyar, Anna Piunova, Xianrui Zheng, Yulan Liu

PDF

Open Access

TL;DR

This paper introduces a real-time multi-turn RNN-T model for streaming multi-party speech recognition, achieving significant WER improvements and generalizing to arbitrary speaker counts.

Contribution

It proposes a novel multi-turn RNN-T architecture with overlap-based target arrangement and on-the-fly overlapping speech simulation for improved real-time multi-party ASR.

Findings

01

14% relative WER improvement with overlap simulation

02

28% relative WER improvement over two-speaker MS-RNN-T

03

Effective generalization to arbitrary number of speakers

Abstract

Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set. Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing