Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy   For Latency

Yangyang Shi; Varun Nagaraja; Chunyang Wu; Jay Mahadeokar; Duc Le,; Rohit Prabhavalkar; Alex Xiao; Ching-Feng Yeh; Julian Chan; Christian Fuegen,; Ozlem Kalinli; Michael L. Seltzer

arXiv:2104.02176·cs.CL·April 7, 2021

Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le,, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen,, Ozlem Kalinli, Michael L. Seltzer

PDF

Open Access

TL;DR

The paper introduces a dynamic encoder transducer (DET) for on-device speech recognition that adaptively balances accuracy and latency by assigning different encoder depths to different parts of an utterance, improving performance across devices.

Contribution

It presents a novel DET framework that enables flexible accuracy-latency trade-offs without retraining, using layer dropout and collaborative learning techniques.

Findings

01

DET reduces WER by over 8% on Librispeech.

02

Lightweight encoder with collaborative learning cuts model size by 25%.

03

DET achieves similar accuracy with better latency on in-house data.

Abstract

We propose a dynamic encoder transducer (DET) for on-device speech recognition. One DET model scales to multiple devices with different computation capacities without retraining or finetuning. To trading off accuracy and latency, DET assigns different encoders to decode different parts of an utterance. We apply and compare the layer dropout and the collaborative learning for DET training. The layer dropout method that randomly drops out encoder layers in the training phase, can do on-demand layer dropout in decoding. Collaborative learning jointly trains multiple encoders with different depths in one single model. Experiment results on Librispeech and in-house data show that DET provides a flexible accuracy and latency trade-off. Results on Librispeech show that the full-size encoder in DET relatively reduces the word error rate of the same size baseline by over 8%. The lightweight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDropout