Multi-Channel Transformer Transducer for Speech Recognition
Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo

TL;DR
This paper introduces the Multi-Channel Transformer Transducer (MCTT), a low-complexity, end-to-end model for multi-channel speech recognition that achieves significant accuracy improvements and faster inference suitable for on-device streaming applications.
Contribution
The paper proposes MCTT, a novel multi-channel transformer transducer model with reduced computational cost and latency, enabling effective on-device streaming speech recognition.
Findings
MCTT achieves up to 6.01% relative WER improvement over stagewise models.
MCTT outperforms multi-channel transformer by 11.62% WERR.
MCTT is 15.8 times faster in inference speed.
Abstract
Multi-channel inputs offer several advantages over single-channel, to improve the robustness of on-device speech recognition systems. Recent work on multi-channel transformer, has proposed a way to incorporate such inputs into end-to-end ASR for improved accuracy. However, this approach is characterized by a high computational complexity, which prevents it from being deployed in on-device systems. In this paper, we present a novel speech recognition model, Multi-Channel Transformer Transducer (MCTT), which features end-to-end multi-channel training, low computation cost, and low latency so that it is suitable for streaming decoding in on-device speech recognition. In a far-field in-house dataset, our MCTT outperforms stagewise multi-channel models with transformer-transducer up to 6.01% relative WER improvement (WERR). In addition, MCTT outperforms the multi-channel transformer up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout · Layer Normalization · Dense Connections · Byte Pair Encoding · Softmax
