Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition
Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita,, Naoyuki Kamo, Takafumi Moriya

TL;DR
This paper introduces a neural network-based method to dynamically switch or combine enhanced and observed speech signals to improve overlapping speech recognition accuracy, outperforming heuristic approaches.
Contribution
It proposes a DNN-based switching and soft-switching approach that adaptively selects or combines signals for better ASR performance, advancing beyond heuristic methods.
Findings
Soft-switching achieved up to 23% relative CER reduction.
DNN-based switching performed comparably to oracle rule-based switching.
The method effectively mitigates enhancement artifacts degrading ASR.
Abstract
The combination of a deep neural network (DNN) -based speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end is a widely used approach to implement overlapping speech recognition. However, the SE front-end generates processing artifacts that can degrade the ASR performance. We previously found that such performance degradation can occur even under fully overlapping conditions, depending on the signal-to-interference ratio (SIR) and signal-to-noise ratio (SNR). To mitigate the degradation, we introduced a rule-based method to switch the ASR input between the enhanced and observed signals, which showed promising results. However, the rule's optimality was unclear because it was heuristically designed and based only on SIR and SNR values. In this work, we propose a DNN-based switching method that directly estimates whether ASR will perform better on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
