Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation
Stefan Braun, Erik McDermott, Roger Hsiao

TL;DR
This paper introduces a memory-efficient training method for neural transducers in speech recognition, enabling larger batch sizes and longer sequences on limited GPU memory by computing loss and gradients sample-wise.
Contribution
The paper proposes a novel sample-wise computation approach for neural transducer training that reduces memory usage and maintains competitive speed.
Findings
Significantly reduces memory consumption during training.
Enables training with batch size of 1024 and 40-second audio sequences on 6 GB GPU.
Maintains competitive training speed compared to traditional batched methods.
Abstract
The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the sample-wise method. In a set of thorough benchmarks, we show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched computation. As a highlight, we manage to compute the transducer loss and gradients for a batch size of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
