Cleanformer: A multichannel array configuration-invariant neural enhancement frontend for ASR in smart speakers
Joseph Caroselli, Arun Narayanan, Nathan Howard, Tom O'Malley

TL;DR
Cleanformer is a multichannel neural enhancement frontend for ASR in smart speakers, using self-attention and adaptive noise cancellation to significantly reduce word error rates across various microphone configurations.
Contribution
The paper introduces Cleanformer, a novel array configuration-invariant neural enhancement model that improves ASR performance without retraining for different microphone setups.
Findings
Significant WER reduction (up to 80%) at -6dB SNR.
Outperforms beamformer with ideal steering.
Performance improves with more microphones, up to 4.
Abstract
This work introduces the Cleanformer, a streaming multichannel neural based enhancement frontend for automatic speech recognition (ASR). This model has a conformer-based architecture which takes as inputs a single channel each of raw and enhanced signals, and uses self-attention to derive a time-frequency mask. The enhanced input is generated by a multichannel adaptive noise cancellation algorithm known as Speech Cleaner, which makes use of noise context to derive its filter taps. The time-frequency mask is applied to the noisy input to produce enhanced output features for ASR. Detailed evaluations are presented with simulated and re-recorded datasets in speech-based and non-speech-based noise that show significant reduction in word error rate (WER) when using a large-scale state-of-the-art ASR model. It also will be shown to significantly outperform enhancement using a beamformer with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
