SNRi Target Training for Joint Speech Enhancement and Recognition

Yuma Koizumi; Shigeki Karita; Arun Narayanan; Sankaran Panchapagesan,; Michiel Bacchiani

arXiv:2111.00764·eess.AS·March 29, 2022·1 cites

SNRi Target Training for Joint Speech Enhancement and Recognition

Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan,, Michiel Bacchiani

PDF

Open Access

TL;DR

This paper introduces SNRi target training, a method to adapt speech enhancement to different noise conditions by controlling the signal-to-noise ratio improvement, thereby improving speech recognition accuracy.

Contribution

The paper proposes a novel joint training approach that controls the enhancement level via SNRi targets, optimizing noise reduction for diverse noise environments.

Findings

01

Reduces word error rate by up to 5.7%

02

Enables control of output SNRi based on noise characteristics

03

Improves robustness of speech recognition in noisy conditions

Abstract

Speech enhancement (SE) is used as a frontend in speech applications including automatic speech recognition (ASR) and telecommunication. A difficulty in using the SE frontend is that the appropriate noise reduction level differs depending on applications and/or noise characteristics. In this study, we propose "signal-to-noise ratio improvement (SNRi) target training"; the SE frontend is trained to output a signal whose SNRi is controlled by an auxiliary scalar input. In joint training with a backend, the target SNRi value is estimated by an auxiliary network. By training all networks to minimize the backend task loss, we can estimate the appropriate noise reduction level for each noisy input in a data-driven scheme. Our experiments showed that the SNRi target training enables control of the output SNRi. In addition, the proposed joint training relatively reduces word error rate by 4.0\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing