Deep Sparse Conformer for Speech Recognition
Xianchao Wu

TL;DR
This paper introduces a deep sparse Conformer model for speech recognition that enhances long-sequence processing by integrating sparse self-attention and deep normalization, achieving state-of-the-art results on Japanese speech datasets.
Contribution
The paper proposes a novel deep sparse Conformer architecture with sparse self-attention and deep residual normalization, enabling effective training of hundreds of layers for speech recognition.
Findings
Achieved CERs of 5.52%, 4.03%, 4.50% on evaluation sets.
Ensembling five deep sparse Conformers reduces CERs to 4.16%, 2.84%, 3.20%.
Demonstrated effective training of 100-layer Conformer models.
Abstract
Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two macaron-like feed-forward layers with half-step residual connections sandwich the multi-head self-attention and convolution modules followed by a post layer normalization. We improve Conformer's long-sequence representation ability in two directions, \emph{sparser} and \emph{deeper}. We adapt a sparse self-attention mechanism with in time complexity and memory usage. A deep normalization strategy is utilized when performing residual connections to ensure our training of hundred-level Conformer blocks. On the Japanese CSJ-500h dataset, this deep sparse Conformer achieves respectively CERs of 5.52\%, 4.03\% and 4.50\% on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsConvolution
