Multi-Channel Transformer Transducer for Speech Recognition

Feng-Ju Chang; Martin Radfar; Athanasios Mouchtaris; Maurizio Omologo

arXiv:2108.12953·eess.AS·August 31, 2021

Multi-Channel Transformer Transducer for Speech Recognition

Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo

PDF

Open Access

TL;DR

This paper introduces the Multi-Channel Transformer Transducer (MCTT), a low-complexity, end-to-end model for multi-channel speech recognition that achieves significant accuracy improvements and faster inference suitable for on-device streaming applications.

Contribution

The paper proposes MCTT, a novel multi-channel transformer transducer model with reduced computational cost and latency, enabling effective on-device streaming speech recognition.

Findings

01

MCTT achieves up to 6.01% relative WER improvement over stagewise models.

02

MCTT outperforms multi-channel transformer by 11.62% WERR.

03

MCTT is 15.8 times faster in inference speed.

Abstract

Multi-channel inputs offer several advantages over single-channel, to improve the robustness of on-device speech recognition systems. Recent work on multi-channel transformer, has proposed a way to incorporate such inputs into end-to-end ASR for improved accuracy. However, this approach is characterized by a high computational complexity, which prevents it from being deployed in on-device systems. In this paper, we present a novel speech recognition model, Multi-Channel Transformer Transducer (MCTT), which features end-to-end multi-channel training, low computation cost, and low latency so that it is suitable for streaming decoding in on-device speech recognition. In a far-field in-house dataset, our MCTT outperforms stagewise multi-channel models with transformer-transducer up to 6.01% relative WER improvement (WERR). In addition, MCTT outperforms the multi-channel transformer up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout · Layer Normalization · Dense Connections · Byte Pair Encoding · Softmax