Streaming Transformer Transducer Based Speech Recognition Using   Non-Causal Convolution

Yangyang Shi; Chunyang Wu; Dilin Wang; Alex Xiao; Jay Mahadeokar,; Xiaohui Zhang; Chunxi Liu; Ke Li; Yuan Shangguan; Varun Nagaraja; Ozlem; Kalinli; Mike Seltzer

arXiv:2110.05241·eess.AS·October 12, 2021·1 cites

Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar,, Xiaohui Zhang, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem, Kalinli, Mike Seltzer

PDF

Open Access

TL;DR

This paper enhances streaming transformer transducer speech recognition by integrating non-causal convolution with lookahead, talking-head attention, and history context compression, resulting in improved accuracy without increasing latency.

Contribution

It introduces non-causal convolution for better lookahead context utilization and novel attention and compression techniques for improved speech recognition performance.

Findings

01

Achieved relative WERR of 5.1%, 14.5%, 8.4% on different scenarios.

02

Improved accuracy with similar latency compared to causal convolution.

03

Enhanced model performance using talking-head attention and history context compression.

Abstract

This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution. Many works apply the causal convolution to improve streaming transformer ignoring the lookahead context. We propose to use non-causal convolution to process the center block and lookahead context separately. This method leverages the lookahead context in convolution and maintains similar training and decoding efficiency. Given the similar latency, using the non-causal convolution with lookahead context gives better accuracy than causal convolution, especially for open-domain dictation scenarios. Besides, this paper applies talking-head attention and a novel history context compression scheme to further improve the performance. The talking-head attention improves the multi-head self-attention by transferring information among different heads. The history context compression…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques

MethodsCausal Convolution · Convolution