Segmental Recurrent Neural Networks for End-to-end Speech Recognition
Liang Lu, Lingpeng Kong, Chris Dyer, Noah A. Smith, Steve Renals

TL;DR
This paper introduces a self-contained, end-to-end segmental RNN-CRF model for speech recognition that jointly learns features and segmentation, achieving state-of-the-art results on TIMIT without external features or language models.
Contribution
It presents a novel end-to-end segmental RNN-CRF model that marginalizes segmentation, removing the need for external systems and enabling joint training for speech recognition.
Findings
Achieved 17.3% PER on TIMIT, the best for CRF-based models.
Model does not rely on external features or language models.
Demonstrated practical training and decoding methods for speech recognition.
Abstract
We study the segmental recurrent neural network for end-to-end acoustic modelling. This model connects the segmental conditional random field (CRF) with a recurrent neural network (RNN) used for feature extraction. Compared to most previous CRF-based acoustic models, it does not rely on an external system to provide features or segmentation boundaries. Instead, this model marginalises out all the possible segmentations, and features are extracted from the RNN trained together with the segmental CRF. In essence, this model is self-contained and can be trained end-to-end. In this paper, we discuss practical training and decoding issues as well as the method to speed up the training in the context of speech recognition. We performed experiments on the TIMIT dataset. We achieved 17.3 phone error rate (PER) from the first-pass decoding --- the best reported result using CRFs, despite the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Conditional Random Field
