Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Fangyuan Wang; Bo Xu

arXiv:2203.15206·cs.SD·September 27, 2022

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Fangyuan Wang, Bo Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a shifted chunk mechanism for Transformer-based streaming ASR, enhancing global context modeling and efficiency while maintaining the advantages of chunk-wise approaches.

Contribution

It proposes a novel shifted chunk mechanism that improves global context modeling in chunk-wise Transformer models for streaming ASR.

Findings

01

Achieves CER of 6.43% with SChunk-Transformer and 5.77% with SChunk-Conformer on AISHELL-1.

02

Models have linear complexity, enabling efficient training and inference.

03

Outperforms conventional chunk-wise models and is competitive with memory-based methods.

Abstract

Currently, there are mainly three kinds of Transformer encoder based streaming End to End (E2E) Automatic Speech Recognition (ASR) approaches, namely time-restricted methods, chunk-wise methods, and memory-based methods. Generally, all of them have limitations in aspects of linear computational complexity, global context modeling, and parallel training. In this work, we aim to build a model to take all these three advantages for streaming Transformer ASR. Particularly, we propose a shifted chunk mechanism for the chunk-wise Transformer which provides cross-chunk connections between chunks. Therefore, the global context modeling ability of chunk-wise models can be significantly enhanced while all the original merits inherited. We integrate this scheme with the chunk-wise Transformer and Conformer, and identify them as SChunk-Transformer and SChunk-Conformer, respectively. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangfangyuan/SChunk-Encoder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsLinear Layer · Residual Connection · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Attention Is All You Need · Label Smoothing · Multi-Head Attention