TL;DR
This paper introduces an improved training pipeline for neural transducers, enhancing model performance and generalization, and demonstrating superior results over attention models on speech recognition tasks.
Contribution
It proposes a new training method with full marginalization, generalizes the model and output topology, and shows improved performance on speech recognition benchmarks.
Findings
Transducer models outperform attention models on longer sequences.
The new training pipeline improves WER by over 6% on Switchboard 300h.
Generalization to various output topologies is demonstrated.
Abstract
The RNN transducer is a promising end-to-end model candidate. We compare the original training criterion with the full marginalization over all alignments, to the commonly used maximum approximation, which simplifies, improves and speeds up our training. We also generalize from the original neural network model and study more powerful models, made possible due to the maximum approximation. We further generalize the output label topology to cover RNN-T, RNA and CTC. We perform several studies among all these aspects, including a study on the effect of external alignments. We find that the transducer model generalizes much better on longer sequences than the attention model. Our final transducer model outperforms our attention model on Switchboard 300h by over 6% relative WER.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
