Augmenting conformers with structured state-space sequence models for   online speech recognition

Haozhe Shan; Albert Gu; Zhong Meng; Weiran Wang; Krzysztof; Choromanski; Tara Sainath

arXiv:2309.08551·cs.CL·December 29, 2023

Augmenting conformers with structured state-space sequence models for online speech recognition

Haozhe Shan, Albert Gu, Zhong Meng, Weiran Wang, Krzysztof, Choromanski, Tara Sainath

PDF

Open Access

TL;DR

This paper enhances online speech recognition by integrating structured state-space sequence models with conformers, achieving superior accuracy through a novel combination of S4 models and convolutions.

Contribution

It introduces a new method of augmenting conformers with S4 models and convolutions, demonstrating improved performance in online ASR tasks.

Findings

01

Best model achieves 4.01%/8.53% WER on Librispeech.

02

Stacking S4 with local convolution is most effective.

03

Augmentation outperforms extensively tuned conformer models.

Abstract

Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We found that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling