Transformer in action: a comparative study of transformer-based acoustic   models for large scale speech recognition applications

Yongqiang Wang; Yangyang Shi; Frank Zhang; Chunyang Wu; Julian Chan,; Ching-Feng Yeh; Alex Xiao

arXiv:2010.14665·cs.CL·November 2, 2020·1 cites

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

Yongqiang Wang, Yangyang Shi, Frank Zhang, Chunyang Wu, Julian Chan,, Ching-Feng Yeh, Alex Xiao

PDF

Open Access

TL;DR

This paper compares transformer-based acoustic models, especially Emformer, with traditional LSTM models for large-scale speech recognition, demonstrating significant improvements in word error rates and inference efficiency across various latency scenarios.

Contribution

It provides a comprehensive comparison of transformer-based models with LSTMs in industrial speech recognition tasks, highlighting the advantages of Emformer in accuracy and efficiency.

Findings

01

Emformer achieves 24-26% relative WERR on low latency voice tasks.

02

Emformer reduces inference real-time factors by 2-3 times in medium latency scenarios.

03

Transformer models outperform LSTMs in large-scale speech recognition applications.

Abstract

In this paper, we summarize the application of transformer and its streamable variant, Emformer based acoustic model for large scale speech recognition applications. We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks. Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks. On a low latency voice assistant task, Emformer gets 24% to 26% relative word error rate reductions (WERRs). For medium latency scenarios, comparing with LCBLSTM with similar model size and latency, Emformer gets significant WERR across four languages in video captioning datasets with 2-3 times inference real-time factors reduction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory