Attention-based Neural Beamforming Layers for Multi-channel Speech   Recognition

Bhargav Pulugundla; Yang Gao; Brian King; Gokce Keskin; Harish; Mallidi; Minhua Wu; Jasha Droppo; Roland Maas

arXiv:2105.05920·eess.AS·May 18, 2021·1 cites

Attention-based Neural Beamforming Layers for Multi-channel Speech Recognition

Bhargav Pulugundla, Yang Gao, Brian King, Gokce Keskin, Harish, Mallidi, Minhua Wu, Jasha Droppo, Roland Maas

PDF

Open Access

TL;DR

This paper introduces a 2D Conv-Attention module combining CNNs and attention mechanisms to improve multi-channel speech recognition, demonstrating a 3.8% relative WER reduction over baseline models.

Contribution

It proposes a novel 2D Conv-Attention layer that explicitly models channel correlations, enhancing neural beamforming for speech recognition.

Findings

01

Achieved a 3.8% relative WER improvement over baseline neural beamformers.

02

Demonstrated effectiveness of combining convolutional and attention mechanisms.

03

Validated on an in-house multi-channel speech dataset.

Abstract

Attention-based beamformers have recently been shown to be effective for multi-channel speech recognition. However, they are less capable at capturing local information. In this work, we propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming. We apply self- and cross-attention to explicitly model the correlations within and between the input channels. The end-to-end 2D Conv-Attention model is compared with a multi-head self-attention and superdirective-based neural beamformers. We train and evaluate on an in-house multi-channel dataset. The results show a relative improvement of 3.8% in WER by the proposed model over the baseline neural beamformer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution