Knowledge Distillation from Non-streaming to Streaming ASR Encoder using   Auxiliary Non-streaming Layer

Kyuhong Shim; Jinkyu Lee; Simyung Chang; Kyuwoong Hwang

arXiv:2308.16415·cs.CL·September 1, 2023

Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer

Kyuhong Shim, Jinkyu Lee, Simyung Chang, Kyuwoong Hwang

PDF

Open Access

TL;DR

This paper introduces a layer-to-layer knowledge distillation method for streaming ASR models, using auxiliary non-streaming layers and autoregressive predictive coding to improve performance over previous token probability distillation approaches.

Contribution

It proposes a novel layer-to-layer KD approach with auxiliary non-streaming layers and APC-based loss, enhancing streaming ASR accuracy beyond existing methods.

Findings

01

Significant WER reduction compared to previous distillation methods

02

Effective use of auxiliary non-streaming layers for feature alignment

03

Improved prediction of unseen future contexts in streaming ASR

Abstract

Streaming automatic speech recognition (ASR) models are restricted from accessing future context, which results in worse performance compared to the non-streaming models. To improve the performance of streaming ASR, knowledge distillation (KD) from the non-streaming to streaming model has been studied, mainly focusing on aligning the output token probabilities. In this paper, we propose a layer-to-layer KD from the teacher encoder to the student encoder. To ensure that features are extracted using the same context, we insert auxiliary non-streaming branches to the student and perform KD from the non-streaming teacher layer to the non-streaming auxiliary layer. We design a special KD loss that leverages the autoregressive predictive coding (APC) mechanism to encourage the streaming model to predict unseen future contexts. Experimental results show that the proposed method can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsKnowledge Distillation