A Study of All-Convolutional Encoders for Connectionist Temporal Classification
Kalpesh Krishna, Liang Lu, Kevin Gimpel, Karen Livescu

TL;DR
This paper investigates replacing RNNs with deep convolutional neural networks as encoders in CTC-based speech recognition, demonstrating faster training and decoding with comparable accuracy.
Contribution
It introduces CNN-based encoders for CTC in speech recognition, showing they are efficient alternatives to RNNs with similar performance.
Findings
CNN encoders are faster to train and decode than RNNs.
CNN models achieve comparable word error rates to LSTMs.
CNNs significantly reduce training and decoding times.
Abstract
Connectionist temporal classification (CTC) is a popular sequence prediction approach for automatic speech recognition that is typically used with models based on recurrent neural networks (RNNs). We explore whether deep convolutional neural networks (CNNs) can be used effectively instead of RNNs as the "encoder" in CTC. CNNs lack an explicit representation of the entire sequence, but have the advantage that they are much faster to train. We present an exploration of CNNs as encoders for CTC models, in the context of character-based (lexicon-free) automatic speech recognition. In particular, we explore a range of one-dimensional convolutional layers, which are particularly efficient. We compare the performance of our CNN-based models against typical RNNbased models in terms of training time, decoding time, model size and word error rate (WER) on the Switchboard Eval2000 corpus. We find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
