A Universally-Deployable ASR Frontend for Joint Acoustic Echo   Cancellation, Speech Enhancement, and Voice Separation

Tom O'Malley; Arun Narayanan; Quan Wang

arXiv:2209.06410·eess.AS·September 15, 2022

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation

Tom O'Malley, Arun Narayanan, Quan Wang

PDF

Open Access

TL;DR

This paper introduces a robust, universally-deployable speech frontend model that integrates acoustic echo cancellation, speech enhancement, and voice separation, with novel architectural improvements and strategies for handling missing contextual signals.

Contribution

It presents a generalized, improved joint model architecture with Signal Dropout for robust performance across varying contextual signal availability.

Findings

01

25.0% relative WER reduction on background speech without speaker embedding

02

61.2% relative WER reduction on AEC without device playback

03

Model performs nearly as well as task-specific models when signals are missing

Abstract

Recent work has shown that it is possible to train a single model to perform joint acoustic echo cancellation (AEC), speech enhancement, and voice separation, thereby serving as a unified frontend for robust automatic speech recognition (ASR). The joint model uses contextual information, such as a reference of the playback audio, noise context, and speaker embedding. In this work, we propose a number of novel improvements to such a model. First, we improve the architecture of the Cross-Attention Conformer that is used to ingest noise context into the model. Second, we generalize the model to be able to handle varying lengths of noise context. Third, we propose Signal Dropout, a novel strategy that models missing contextual information. In the absence of one or more signals, the proposed model performs nearly as well as task-specific models trained without these signals; and when such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques

MethodsDropout