Self-Attention Channel Combinator Frontend for End-to-End Multichannel   Far-field Speech Recognition

Rong Gong; Carl Quillen; Dushyant Sharma; Andrew Goderre; Jos\'e; La\'inez; Ljubomir Milanovi\'c

arXiv:2109.04783·cs.SD·September 13, 2021

Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

Rong Gong, Carl Quillen, Dushyant Sharma, Andrew Goderre, Jos\'e, La\'inez, Ljubomir Milanovi\'c

PDF

Open Access

TL;DR

This paper introduces the self-attention channel combinator (SACC), a novel multichannel audio frontend for end-to-end speech recognition that outperforms traditional beamformer-based methods by leveraging self-attention in the spectral domain.

Contribution

The paper proposes the SACC frontend, which uses self-attention to combine multichannel signals, demonstrating improved performance over fixed beamformer approaches in end-to-end ASR systems.

Findings

01

SACC achieved 9.3% WERR over fixed beamformer frontend.

02

SACC outperforms traditional beamformers in multichannel speech recognition.

03

Analysis shows the connection between SACC and classical beamformers.

Abstract

When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Variance Distortionless Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing