Multi-turn RNN-T for streaming recognition of multi-party speech
Ilya Sklyar, Anna Piunova, Xianrui Zheng, Yulan Liu

TL;DR
This paper introduces a real-time multi-turn RNN-T model for streaming multi-party speech recognition, achieving significant WER improvements and generalizing to arbitrary speaker counts.
Contribution
It proposes a novel multi-turn RNN-T architecture with overlap-based target arrangement and on-the-fly overlapping speech simulation for improved real-time multi-party ASR.
Findings
14% relative WER improvement with overlap simulation
28% relative WER improvement over two-speaker MS-RNN-T
Effective generalization to arbitrary number of speakers
Abstract
Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set. Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
