Memory-Efficient Training of RNN-Transducer with Sampled Softmax
Jaesong Lee, Lukas Lee, Shinji Watanabe

TL;DR
This paper introduces a memory-efficient training method for RNN-Transducer speech recognition models by applying sampled softmax, reducing memory use while maintaining accuracy across multiple datasets.
Contribution
The authors adapt sampled softmax for RNN-Transducer, extending it to optimize memory for minibatches and using auxiliary CTC distributions to enhance accuracy.
Findings
Significant memory reduction during training.
Maintained baseline accuracy on LibriSpeech, AISHELL-1, and CSJ-APS.
Effective memory optimization for end-to-end speech recognition.
Abstract
RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition. Although RNN-Transducer has many advantages including its strong accuracy and streaming-friendly property, its high memory consumption during training has been a critical problem for development. In this work, we propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption. We further extend sampled softmax to optimize memory consumption for a minibatch, and employ distributions of auxiliary CTC losses for sampling vocabulary to improve model accuracy. We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsSoftmax
