EESEN: End-to-End Speech Recognition using Deep RNN Models and   WFST-based Decoding

Yajie Miao; Mohammad Gowayyed; Florian Metze

arXiv:1507.08240·cs.CL·October 20, 2015·169 cites

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

Yajie Miao, Mohammad Gowayyed, Florian Metze

PDF

Open Access 4 Repos

TL;DR

Eesen introduces an end-to-end speech recognition framework that simplifies system building by integrating RNN acoustic models with WFST-based decoding, achieving comparable accuracy with faster decoding.

Contribution

The paper presents a unified RNN-based acoustic modeling approach combined with WFST decoding, eliminating the need for complex pipeline stages in ASR system development.

Findings

01

Achieves comparable WERs to hybrid DNN systems.

02

Significantly speeds up decoding process.

03

Simplifies ASR system construction.

Abstract

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources, multiple training stages and significant expertise. This paper presents our Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems. Acoustic modeling in Eesen involves learning a single recurrent neural network (RNN) predicting context-independent targets (phonemes or characters). To remove the need for pre-generated frame labels, we adopt the connectionist temporal classification (CTC) objective function to infer the alignments between speech and label sequences. A distinctive feature of Eesen is a generalized decoding approach based on weighted finite-state transducers (WFSTs), which enables the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing