A Comparison of Adaptation Techniques and Recurrent Neural Network Architectures
Jan Vanek, Josef Michalek, Jan Zelinka, Josef Psutka

TL;DR
This paper compares various neural network architectures and adaptation techniques for acoustic modeling in speech recognition, highlighting the effectiveness of certain methods with RNNs on the TIMIT dataset.
Contribution
It provides a comprehensive comparison of five neural network architectures and multiple adaptation and normalization techniques for RNN-based speech recognition.
Findings
Not all adaptation techniques effective for feed-forward NNs work with RNNs.
Certain adaptation and normalization methods improved RNN performance on TIMIT.
Open-source scripts enable easy replication and further research.
Abstract
Recently, recurrent neural networks have become state-of-the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and feature normalization techniques. We have evaluated feature-space maximum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, according to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsAttention Model · Sigmoid Activation · Tanh Activation · Long Short-Term Memory
