EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding
Yajie Miao, Mohammad Gowayyed, Florian Metze

TL;DR
Eesen introduces an end-to-end speech recognition framework that simplifies system building by integrating RNN acoustic models with WFST-based decoding, achieving comparable accuracy with faster decoding.
Contribution
The paper presents a unified RNN-based acoustic modeling approach combined with WFST decoding, eliminating the need for complex pipeline stages in ASR system development.
Findings
Achieves comparable WERs to hybrid DNN systems.
Significantly speeds up decoding process.
Simplifies ASR system construction.
Abstract
The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources, multiple training stages and significant expertise. This paper presents our Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems. Acoustic modeling in Eesen involves learning a single recurrent neural network (RNN) predicting context-independent targets (phonemes or characters). To remove the need for pre-generated frame labels, we adopt the connectionist temporal classification (CTC) objective function to infer the alignments between speech and label sequences. A distinctive feature of Eesen is a generalized decoding approach based on weighted finite-state transducers (WFSTs), which enables the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
