Streaming Target-Speaker ASR with Neural Transducer
Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro, Shinozaki

TL;DR
This paper introduces a streaming target-speaker ASR system that integrates speech extraction within an end-to-end neural transducer model, achieving comparable accuracy to cascade systems while reducing computation and enabling real-time processing.
Contribution
It presents a novel streaming Conformer-based neural transducer approach for target-speaker ASR that eliminates the need for separate speech separation modules.
Findings
Achieves comparable accuracy to cascade systems in offline mode.
Reduces computation costs for streaming target-speaker recognition.
Enables real-time, streaming ASR for multi-talker speech.
Abstract
Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a speech separation or target speech extraction front-end with an ASR back-end. However, the extra computation costs of the front-end module are a critical barrier to quick response, especially for streaming ASR. In this paper, we propose a target-speaker ASR (TS-ASR) system that implicitly integrates the target speech extraction functionality within a streaming end-to-end (E2E) ASR system, i.e. recurrent neural network-transducer (RNNT). Our system uses a similar idea as adopted for target speech extraction, but implements it directly at the level of the encoder of RNNT. This allows TS-ASR to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsBalanced Selection
