Learning a Dual-Mode Speech Recognition Model via Self-Pruning

Chunxi Liu; Yuan Shangguan; Haichuan Yang; Yangyang Shi; Raghuraman; Krishnamoorthi; Ozlem Kalinli

arXiv:2207.11906·eess.AS·October 10, 2022

Learning a Dual-Mode Speech Recognition Model via Self-Pruning

Chunxi Liu, Yuan Shangguan, Haichuan Yang, Yangyang Shi, Raghuraman, Krishnamoorthi, Ozlem Kalinli

PDF

Open Access

TL;DR

This paper proposes a unified supernet training approach to jointly learn a compact streaming ASR model and a large non-streaming model, improving performance for both use cases through self-supervised and supervised training.

Contribution

It introduces a novel supernet training method that jointly optimizes sparse streaming and dense non-streaming ASR models, enhancing their performance simultaneously.

Findings

01

Supernet training improves both streaming and non-streaming models.

02

Self-supervised and supervised training synergistically enhance model quality.

03

The approach simplifies deployment for diverse ASR applications.

Abstract

There is growing interest in unifying the streaming and full-context automatic speech recognition (ASR) networks into a single end-to-end ASR model to simplify the model training and deployment for both use cases. While in real-world ASR applications, the streaming ASR models typically operate under more storage and computational constraints - e.g., on embedded devices - than any server-side full-context models. Motivated by the recent progress in Omni-sparsity supernet training, where multiple subnetworks are jointly optimized in one single model, this work aims to jointly learn a compact sparse on-device streaming ASR model, and a large dense server non-streaming model, in a single supernet. Next, we present that, performing supernet training on both wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not only substantially improve the large non-streaming model as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing