Deep Sparse Conformer for Speech Recognition

Xianchao Wu

arXiv:2209.00260·cs.CL·September 2, 2022

Deep Sparse Conformer for Speech Recognition

Xianchao Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a deep sparse Conformer model for speech recognition that enhances long-sequence processing by integrating sparse self-attention and deep normalization, achieving state-of-the-art results on Japanese speech datasets.

Contribution

The paper proposes a novel deep sparse Conformer architecture with sparse self-attention and deep residual normalization, enabling effective training of hundreds of layers for speech recognition.

Findings

01

Achieved CERs of 5.52%, 4.03%, 4.50% on evaluation sets.

02

Ensembling five deep sparse Conformers reduces CERs to 4.16%, 2.84%, 3.20%.

03

Demonstrated effective training of 100-layer Conformer models.

Abstract

Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two macaron-like feed-forward layers with half-step residual connections sandwich the multi-head self-attention and convolution modules followed by a post layer normalization. We improve Conformer's long-sequence representation ability in two directions, \emph{sparser} and \emph{deeper}. We adapt a sparse self-attention mechanism with $O (L log L)$ in time complexity and memory usage. A deep normalization strategy is utilized when performing residual connections to ensure our training of hundred-level Conformer blocks. On the Japanese CSJ-500h dataset, this deep sparse Conformer achieves respectively CERs of 5.52\%, 4.03\% and 4.50\% on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xianchao-wu/wenet-deep-sparse-conformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution