End-to-End Multi-Channel Transformer for Speech Recognition

Feng-Ju Chang; Martin Radfar; Athanasios Mouchtaris; Brian King; and; Siegfried Kunzmann

arXiv:2102.03951·eess.AS·February 9, 2021·1 cites

End-to-End Multi-Channel Transformer for Speech Recognition

Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, and, Siegfried Kunzmann

PDF

Open Access

TL;DR

This paper introduces a multi-channel transformer architecture that effectively integrates spectral and spatial information from multiple microphones for improved speech recognition, outperforming traditional single-channel and beamforming methods.

Contribution

The paper presents a novel multi-channel transformer model with specialized attention layers for encoding inter-channel and temporal relationships in speech recognition.

Findings

01

Outperforms baseline single-channel transformer

02

Surpasses super-directive and neural beamformers in accuracy

03

Effective in far-field multi-microphone scenarios

Abstract

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship within and between channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing