A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation
Tom O'Malley, Arun Narayanan, Quan Wang

TL;DR
This paper introduces a robust, universally-deployable speech frontend model that integrates acoustic echo cancellation, speech enhancement, and voice separation, with novel architectural improvements and strategies for handling missing contextual signals.
Contribution
It presents a generalized, improved joint model architecture with Signal Dropout for robust performance across varying contextual signal availability.
Findings
25.0% relative WER reduction on background speech without speaker embedding
61.2% relative WER reduction on AEC without device playback
Model performs nearly as well as task-specific models when signals are missing
Abstract
Recent work has shown that it is possible to train a single model to perform joint acoustic echo cancellation (AEC), speech enhancement, and voice separation, thereby serving as a unified frontend for robust automatic speech recognition (ASR). The joint model uses contextual information, such as a reference of the playback audio, noise context, and speaker embedding. In this work, we propose a number of novel improvements to such a model. First, we improve the architecture of the Cross-Attention Conformer that is used to ingest noise context into the model. Second, we generalize the model to be able to handle varying lengths of noise context. Third, we propose Signal Dropout, a novel strategy that models missing contextual information. In the absence of one or more signals, the proposed model performs nearly as well as task-specific models trained without these signals; and when such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
MethodsDropout
