Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance
Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita,, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

TL;DR
This paper investigates how speech enhancement errors, especially artifacts, degrade speech recognition and proposes analysis and methods to reduce these errors, leading to improved ASR performance in noisy conditions.
Contribution
It introduces a novel error decomposition scheme and proposes two approaches, OA post-processing and AB-SDR training, to specifically reduce artifact errors impacting ASR.
Findings
Artifact errors are more detrimental to ASR than interference or noise.
OA post-processing improves the signal-to-artifact ratio monotonically.
AB-SDR training reduces artifact errors and enhances ASR accuracy.
Abstract
It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
