Recent Advances in End-to-End Automatic Speech Recognition
Jinyu Li

TL;DR
This paper reviews recent progress in end-to-end speech recognition models, highlighting technological advances that address industry challenges and comparing them to traditional hybrid systems.
Contribution
It provides an overview of recent E2E ASR developments focusing on practical industry challenges and solutions.
Findings
E2E models achieve state-of-the-art accuracy on benchmarks.
Hybrid models remain dominant in commercial systems due to practical factors.
Recent advances aim to bridge the gap between E2E performance and industry deployment.
Abstract
Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
