Exploring Neural Transducers for End-to-End Speech Recognition
Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur,, Yi Li, Hairong Liu, Sanjeev Satheesh, David Seetapun, Anuroop Sriram, Zhenyao, Zhu

TL;DR
This paper empirically compares CTC, RNN-Transducer, and Seq2Seq models for end-to-end speech recognition, demonstrating that Seq2Seq and RNN-Transducer outperform CTC models without language models, simplifying the recognition pipeline.
Contribution
It provides a comprehensive empirical comparison of different end-to-end speech recognition models and analyzes how encoder architecture choices impact their performance.
Findings
Seq2Seq and RNN-Transducer outperform CTC models without language models.
RNN-Transducer models with language model rescoring outperform CTC models.
Encoder architecture significantly influences model performance.
Abstract
In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a language model, on the popular Hub5'00 benchmark. On our internal diverse dataset, these trends continue - RNNTransducer models rescored with a language model after beam search outperform our best CTC models. These results simplify the speech recognition pipeline so that decoding can now be expressed purely as neural network operations. We also study how the choice of encoder architecture affects the performance of the three models - when all encoder layers are forward only, and when encoders downsample the input representation aggressively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
