Memory-Efficient Training of RNN-Transducer with Sampled Softmax

Jaesong Lee; Lukas Lee; Shinji Watanabe

arXiv:2203.16868·eess.AS·April 1, 2022

Memory-Efficient Training of RNN-Transducer with Sampled Softmax

Jaesong Lee, Lukas Lee, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces a memory-efficient training method for RNN-Transducer speech recognition models by applying sampled softmax, reducing memory use while maintaining accuracy across multiple datasets.

Contribution

The authors adapt sampled softmax for RNN-Transducer, extending it to optimize memory for minibatches and using auxiliary CTC distributions to enhance accuracy.

Findings

01

Significant memory reduction during training.

02

Maintained baseline accuracy on LibriSpeech, AISHELL-1, and CSJ-APS.

03

Effective memory optimization for end-to-end speech recognition.

Abstract

RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition. Although RNN-Transducer has many advantages including its strong accuracy and streaming-friendly property, its high memory consumption during training has been a critical problem for development. In this work, we propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption. We further extend sampled softmax to optimize memory consumption for a minibatch, and employ distributions of auxiliary CTC losses for sampling vocabulary to improve model accuracy. We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing

MethodsSoftmax