Streaming Target-Speaker ASR with Neural Transducer

Takafumi Moriya; Hiroshi Sato; Tsubasa Ochiai; Marc Delcroix; Takahiro; Shinozaki

arXiv:2209.04175·eess.AS·September 20, 2022

Streaming Target-Speaker ASR with Neural Transducer

Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro, Shinozaki

PDF

Open Access

TL;DR

This paper introduces a streaming target-speaker ASR system that integrates speech extraction within an end-to-end neural transducer model, achieving comparable accuracy to cascade systems while reducing computation and enabling real-time processing.

Contribution

It presents a novel streaming Conformer-based neural transducer approach for target-speaker ASR that eliminates the need for separate speech separation modules.

Findings

01

Achieves comparable accuracy to cascade systems in offline mode.

02

Reduces computation costs for streaming target-speaker recognition.

03

Enables real-time, streaming ASR for multi-talker speech.

Abstract

Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a speech separation or target speech extraction front-end with an ASR back-end. However, the extra computation costs of the front-end module are a critical barrier to quick response, especially for streaming ASR. In this paper, we propose a target-speaker ASR (TS-ASR) system that implicitly integrates the target speech extraction functionality within a streaming end-to-end (E2E) ASR system, i.e. recurrent neural network-transducer (RNNT). Our system uses a similar idea as adopted for target speech extraction, but implements it directly at the level of the encoder of RNNT. This allows TS-ASR to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsBalanced Selection