Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech   Recognition

Huaibo Zhao; Yosuke Higuchi; Yusuke Kida; Tetsuji Ogawa; Tetsunori; Kobayashi

arXiv:2309.04654·cs.SD·September 12, 2023

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Huaibo Zhao, Yosuke Higuchi, Yusuke Kida, Tetsuji Ogawa, Tetsunori, Kobayashi

PDF

Open Access

TL;DR

This paper investigates Mask-CTC-based encoder pre-training to improve accuracy and reduce latency in streaming end-to-end speech recognition across different model architectures, emphasizing look-ahead capabilities and output timing.

Contribution

It demonstrates the effectiveness of Mask-CTC pre-training for various architectures and analyzes its impact on latency and output timing in streaming ASR.

Findings

01

Pre-training improves accuracy and reduces latency.

02

Effective look-ahead capability is achieved across models.

03

Enhanced output spike timing accuracy.

Abstract

Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing