Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution
Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar,, Xiaohui Zhang, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem, Kalinli, Mike Seltzer

TL;DR
This paper enhances streaming transformer transducer speech recognition by integrating non-causal convolution with lookahead, talking-head attention, and history context compression, resulting in improved accuracy without increasing latency.
Contribution
It introduces non-causal convolution for better lookahead context utilization and novel attention and compression techniques for improved speech recognition performance.
Findings
Achieved relative WERR of 5.1%, 14.5%, 8.4% on different scenarios.
Improved accuracy with similar latency compared to causal convolution.
Enhanced model performance using talking-head attention and history context compression.
Abstract
This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution. Many works apply the causal convolution to improve streaming transformer ignoring the lookahead context. We propose to use non-causal convolution to process the center block and lookahead context separately. This method leverages the lookahead context in convolution and maintains similar training and decoding efficiency. Given the similar latency, using the non-causal convolution with lookahead context gives better accuracy than causal convolution, especially for open-domain dictation scenarios. Besides, this paper applies talking-head attention and a novel history context compression scheme to further improve the performance. The talking-head attention improves the multi-head self-attention by transferring information among different heads. The history context compression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
MethodsCausal Convolution · Convolution
