Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency
Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le,, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen,, Ozlem Kalinli, Michael L. Seltzer

TL;DR
The paper introduces a dynamic encoder transducer (DET) for on-device speech recognition that adaptively balances accuracy and latency by assigning different encoder depths to different parts of an utterance, improving performance across devices.
Contribution
It presents a novel DET framework that enables flexible accuracy-latency trade-offs without retraining, using layer dropout and collaborative learning techniques.
Findings
DET reduces WER by over 8% on Librispeech.
Lightweight encoder with collaborative learning cuts model size by 25%.
DET achieves similar accuracy with better latency on in-house data.
Abstract
We propose a dynamic encoder transducer (DET) for on-device speech recognition. One DET model scales to multiple devices with different computation capacities without retraining or finetuning. To trading off accuracy and latency, DET assigns different encoders to decode different parts of an utterance. We apply and compare the layer dropout and the collaborative learning for DET training. The layer dropout method that randomly drops out encoder layers in the training phase, can do on-demand layer dropout in decoding. Collaborative learning jointly trains multiple encoders with different depths in one single model. Experiment results on Librispeech and in-house data show that DET provides a flexible accuracy and latency trade-off. Results on Librispeech show that the full-size encoder in DET relatively reduces the word error rate of the same size baseline by over 8%. The lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsDropout
