SSCFormer: Push the Limit of Chunk-wise Conformer for Streaming ASR   Using Sequentially Sampled Chunks and Chunked Causal Convolution

Fangyuan Wang; Bo Xu; Bo Xu

arXiv:2211.11419·cs.SD·February 6, 2024

SSCFormer: Push the Limit of Chunk-wise Conformer for Streaming ASR Using Sequentially Sampled Chunks and Chunked Causal Convolution

Fangyuan Wang, Bo Xu, Bo Xu

PDF

Open Access

TL;DR

SSCFormer advances streaming ASR by introducing a novel context generation method and chunked causal convolution, enabling better global context capture, efficient training, and linear inference complexity.

Contribution

The paper proposes SSCFormer, a new chunk-wise conformer architecture with sequential sampling and chunked causal convolution for improved streaming ASR.

Findings

01

Achieves 5.33% CER on AISHELL-1, outperforming baseline.

02

Enables training with large batch sizes and linear inference complexity.

03

Effectively captures long-term context in streaming ASR.

Abstract

Currently, the chunk-wise schemes are often used to make Automatic Speech Recognition (ASR) models to support streaming deployment. However, existing approaches are unable to capture the global context, lack support for parallel training, or exhibit quadratic complexity for the computation of multi-head self-attention (MHSA). On the other side, the causal convolution, no future context used, has become the de facto module in streaming Conformer. In this paper, we propose SSCFormer to push the limit of chunk-wise Conformer for streaming ASR using the following two techniques: 1) A novel cross-chunks context generation method, named Sequential Sampling Chunk (SSC) scheme, to re-partition chunks from regular partitioned chunks to facilitate efficient long-term contextual interaction within local chunks. 2)The Chunked Causal Convolution (C2Conv) is designed to concurrently capture the left…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution