Should We Always Separate?: Switching Between Enhanced and Observed   Signals for Overlapping Speech Recognition

Hiroshi Sato; Tsubasa Ochiai; Marc Delcroix; Keisuke Kinoshita,; Takafumi Moriya; Naoyuki Kamo

arXiv:2106.00949·eess.AS·June 17, 2022

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita,, Takafumi Moriya, Naoyuki Kamo

PDF

TL;DR

This paper investigates whether speech separation always benefits automatic speech recognition in overlapping speech scenarios, finding that sometimes processing artifacts from enhancement can harm ASR performance, and proposes a switching method based on signal quality metrics.

Contribution

The paper provides an analysis showing when speech enhancement degrades ASR and introduces a simple switching algorithm to improve recognition accuracy.

Findings

01

Switching between observed and enhanced speech can improve ASR performance.

02

Speech enhancement may degrade ASR under certain noise and interference conditions.

03

A simple signal-based switching method outperforms always using separation or observed speech alone.

Abstract

Although recent advances in deep learning technology improved automatic speech recognition (ASR), it remains difficult to recognize speech when it overlaps other people's voices. Speech separation or extraction is often used as a front-end to ASR to handle such overlapping speech. However, deep neural network-based speech enhancement can generate `processing artifacts' as a side effect of the enhancement, which degrades ASR performance. For example, it is well known that single-channel noise reduction for non-speech noise (non-overlapping speech) often does not improve ASR. Likewise, the processing artifacts may also be detrimental to ASR in some conditions when processing overlapping speech with a separation/extraction method, although it is usually believed that separation/extraction improves ASR. In order to answer the question `Do we always have to separate/extract speech from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.