Multi-Head State Space Model for Speech Recognition
Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan, Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer,, Mark J. F. Gales

TL;DR
This paper introduces a multi-head state space model with gating mechanisms for speech recognition, outperforming transformers on LibriSpeech by effectively capturing local and global temporal dynamics.
Contribution
The paper presents a novel multi-head state space architecture with gating, serving as a drop-in replacement for attention in transformers, and introduces the Stateformer model achieving state-of-the-art results.
Findings
Outperforms transformer transducer on LibriSpeech
Achieves state-of-the-art WER of 1.76%/4.37% on dev and 1.91%/4.36% on test sets
No external language model used
Abstract
State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsTest · Softmax · Linear Layer
