Joint Optimization of Streaming and Non-Streaming Automatic Speech   Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel; Yui Sudo; Yifan Peng; Shinji Watanabe

arXiv:2405.13514·eess.AS·September 12, 2024

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

PDF

TL;DR

This paper introduces a joint optimization framework for streaming and non-streaming speech recognition using multi-decoder architecture and knowledge distillation, achieving significant error rate improvements within a single model.

Contribution

It proposes a novel multi-decoder and knowledge distillation approach for unified streaming and non-streaming ASR, enhancing flexibility and performance.

Findings

01

2.6%-5.3% relative CERR reduction for streaming ASR

02

8.3%-9.7% relative CERR reduction for non-streaming ASR

03

Single model achieves competitive performance for both modes

Abstract

End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsKnowledge Distillation