CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming   ASR

Keyu An; Huahuan Zheng; Zhijian Ou; Hongyu Xiang; Ke Ding; and Guanglu Wan

arXiv:2203.16758·eess.AS·August 3, 2022

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

Keyu An, Huahuan Zheng, Zhijian Ou, Hongyu Xiang, Ke Ding, and Guanglu Wan

PDF

Open Access 1 Repo

TL;DR

CUSIDE introduces a novel framework for streaming ASR that uses a simulation module to generate future context, reducing latency while maintaining accuracy, and achieves state-of-the-art results on AISHELL-1.

Contribution

The paper proposes a new framework with a simulation module for future context in streaming ASR, reducing latency without sacrificing accuracy.

Findings

01

Simulated future context reduces latency significantly.

02

State-of-the-art results on AISHELL-1 dataset.

03

Joint training improves recognition performance.

Abstract

History and future contextual information are known to be important for accurate acoustic modeling. However, acquiring future context brings latency for streaming ASR. In this paper, we propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition. A new simulation module is introduced to recursively simulate the future contextual frames, without waiting for future context. The simulation module is jointly trained with the ASR model using a self-supervised loss; the ASR model is optimized with the usual ASR loss, e.g., CTC-CRF as used in our experiments. Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy. With CUSIDE, we obtain new state-of-the-art streaming ASR results on the AISHELL-1 dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-spmi/cat
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing