Spatial Processing Front-End For Distant ASR Exploiting Self-Attention Channel Combinator
Dushyant Sharma, Rong Gong, James Fosburgh, Stanislav Yu., Kruchinin, Patrick A. Naylor, Ljubomir Milanovic

TL;DR
This paper introduces a novel multi-channel front-end for distant ASR that combines channel shortening, self-attention-based channel combination, and beamforming, significantly improving recognition accuracy and dereverberation.
Contribution
It proposes a new combination of WPE, self-attention-based channel combination, and beamforming for distant ASR, demonstrating superior performance over existing systems.
Findings
21.6% relative WER reduction on LibriSpeech dataset
13.6 dB dereverberation improvement with WPE
SACC weights enable accurate spatial information extraction
Abstract
We present a novel multi-channel front-end based on channel shortening with theWeighted Prediction Error (WPE) method followed by a fixed MVDR beamformer used in combination with a recently proposed self-attention-based channel combination (SACC) scheme, for tackling the distant ASR problem. We show that the proposed system used as part of a ContextNet based end-to-end (E2E) ASR system outperforms leading ASR systems as demonstrated by a 21.6% reduction in relative WER on a multi-channel LibriSpeech playback dataset. We also show how dereverberation prior to beamforming is beneficial and compare the WPE method with a modified neural channel shortening approach. An analysis of the non-intrusive estimate of the signal C50 confirms that the 8 channel WPE method provides significant dereverberation of the signals (13.6 dB improvement). We also show how the weights of the SACC system allow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation
