A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Shaojin Ding; Weiran Wang; Ding Zhao; Tara N. Sainath; Yanzhang He,; Robert David; Rami Botros; Xin Wang; Rina Panigrahy; Qiao Liang; Dongseong; Hwang; Ian McGraw; Rohit Prabhavalkar; Trevor Strohman

arXiv:2204.06164·eess.AS·June 28, 2022

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He,, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong, Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

PDF

Open Access

TL;DR

This paper introduces a dynamic cascaded encoder ASR model that unifies multiple model sizes, significantly reducing size and power consumption while maintaining quality, and simplifies deployment across various scenarios.

Contribution

It presents a novel unified model architecture with techniques to optimize performance and efficiency for different deployment sizes, reducing engineering complexity.

Findings

01

30% smaller model size with 33% less power consumption

02

Unified model achieves 37% total size reduction

03

Minimal quality loss across different model sizes

Abstract

In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separate decoders for each sub-model while sharing the encoders; 2) Use funnel-pooling to improve the encoder efficiency; 3) Balance the size of causal and non-causal encoders to improve quality and fit deployment constraints. Overall, the proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model. The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies