End-to-End Multi-Channel Transformer for Speech Recognition
Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, and, Siegfried Kunzmann

TL;DR
This paper introduces a multi-channel transformer architecture that effectively integrates spectral and spatial information from multiple microphones for improved speech recognition, outperforming traditional single-channel and beamforming methods.
Contribution
The paper presents a novel multi-channel transformer model with specialized attention layers for encoding inter-channel and temporal relationships in speech recognition.
Findings
Outperforms baseline single-channel transformer
Surpasses super-directive and neural beamformers in accuracy
Effective in far-field multi-microphone scenarios
Abstract
Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship within and between channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
