Flexi-Transducer: Optimizing Latency, Accuracy and Compute   forMulti-Domain On-Device Scenarios

Jay Mahadeokar; Yangyang Shi; Yuan Shangguan; Chunyang Wu; Alex Xiao,; Hang Su; Duc Le; Ozlem Kalinli; Christian Fuegen; Michael L. Seltzer

arXiv:2104.02232·cs.SD·April 7, 2021·1 cites

Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao,, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

PDF

Open Access

TL;DR

Flexi-Transducer is a unified on-device speech recognition model that adaptively balances latency and accuracy across multiple domains using domain-specific techniques and a domain indicator, optimizing performance for diverse use-cases.

Contribution

The paper introduces Flexi-Transducer, a novel single model that employs domain-specific segment size adjustments, a restricted RNNT loss, and a domain indicator vector to optimize latency and accuracy for multi-domain on-device ASR.

Findings

01

Improved WER for dictation scenarios.

02

Reduced real-time factor for voice commands.

03

Achieved flexible latency-accuracy trade-offs.

Abstract

Often, the storage and computational constraints of embeddeddevices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose aFlexibleTransducer(FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provides a fast response for voice commands, and accurate transcription but with more latency for dictation. In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used. Firstly, we propose using domain-specific altering of segment size for Emformer encoder that enables FlexiT to achieve flexible de-coding. Secondly, we use Alignment Restricted RNNT loss to achieve flexible fine-grained control on token emission latency for different domains. Finally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems