Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications
Yongqiang Wang, Yangyang Shi, Frank Zhang, Chunyang Wu, Julian Chan,, Ching-Feng Yeh, Alex Xiao

TL;DR
This paper compares transformer-based acoustic models, especially Emformer, with traditional LSTM models for large-scale speech recognition, demonstrating significant improvements in word error rates and inference efficiency across various latency scenarios.
Contribution
It provides a comprehensive comparison of transformer-based models with LSTMs in industrial speech recognition tasks, highlighting the advantages of Emformer in accuracy and efficiency.
Findings
Emformer achieves 24-26% relative WERR on low latency voice tasks.
Emformer reduces inference real-time factors by 2-3 times in medium latency scenarios.
Transformer models outperform LSTMs in large-scale speech recognition applications.
Abstract
In this paper, we summarize the application of transformer and its streamable variant, Emformer based acoustic model for large scale speech recognition applications. We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks. Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks. On a low latency voice assistant task, Emformer gets 24% to 26% relative word error rate reductions (WERRs). For medium latency scenarios, comparing with LCBLSTM with similar model size and latency, Emformer gets significant WERR across four languages in video captioning datasets with 2-3 times inference real-time factors reduction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
