Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR
Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan,, Pradyot Prakash, Jun Liu, Ching-Feng Yeh, Fuchun Peng, Yatharth Saraf,, Geoffrey Zweig

TL;DR
This paper provides a comprehensive benchmark comparing LF-MMI, CTC, and RNN-T training criteria for streaming ASR across multiple languages, evaluating accuracy and efficiency with large-scale real-world data.
Contribution
It is the first large-scale, multi-language benchmark comparing these three training criteria for streaming ASR, including various modeling strategies.
Findings
RNN-T achieves the highest ASR accuracy.
CTC models are more efficient during inference.
Different modeling strategies impact performance and efficiency.
Abstract
In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training data 3K-14K hours, we conduct large-scale controlled experimentation across each criterion using identical datasets and encoder model architecture. We find that RNN-T has consistent wins in ASR accuracy, while CTC models excel at inference efficiency. Moreover, we selectively examine various modeling strategies for different training criteria, including modeling units, encoder architectures, pre-training, etc. Given such large-scale real-world streaming ASR application, to our best knowledge, we present the first comprehensive benchmark on these three widely used training criteria across a great…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
