Improving RNN Transducer With Target Speaker Extraction and Neural   Uncertainty Estimation

Jiatong Shi; Chunlei Zhang; Chao Weng; Shinji Watanabe; Meng Yu; Dong; Yu

arXiv:2011.13393·cs.SD·March 1, 2021

Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation

Jiatong Shi, Chunlei Zhang, Chao Weng, Shinji Watanabe, Meng Yu, Dong, Yu

PDF

Open Access

TL;DR

This paper introduces a joint framework combining target-speaker speech extraction with RNN-T, utilizing neural uncertainty estimation to improve speech recognition accuracy in noisy multi-speaker environments.

Contribution

It proposes a multi-stage training strategy and neural uncertainty measures to enhance RNN-T performance in challenging acoustic conditions.

Findings

01

Achieves 17% relative CER reduction with neural uncertainty module.

02

Gains 9% relative performance improvement in noisy environments.

03

Maintains performance in clean conditions.

Abstract

Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training. Meanwhile, speaker identity and speech enhancement uncertainty measures are proposed to compensate for residual noise and artifacts from the target speech extraction module. Compared to a recognizer fine-tuned with a target speech extraction model, our experiments show that adding the neural uncertainty module significantly reduces 17% relative Character Error Rate (CER) on multi-speaker signals with background noise. The multi-condition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing