Multi-Head State Space Model for Speech Recognition

Yassir Fathullah; Chunyang Wu; Yuan Shangguan; Junteng Jia; Wenhan; Xiong; Jay Mahadeokar; Chunxi Liu; Yangyang Shi; Ozlem Kalinli; Mike Seltzer,; Mark J. F. Gales

arXiv:2305.12498·eess.AS·May 29, 2023·1 cites

Multi-Head State Space Model for Speech Recognition

Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan, Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer,, Mark J. F. Gales

PDF

Open Access

TL;DR

This paper introduces a multi-head state space model with gating mechanisms for speech recognition, outperforming transformers on LibriSpeech by effectively capturing local and global temporal dynamics.

Contribution

The paper presents a novel multi-head state space architecture with gating, serving as a drop-in replacement for attention in transformers, and introduces the Stateformer model achieving state-of-the-art results.

Findings

01

Outperforms transformer transducer on LibriSpeech

02

Achieves state-of-the-art WER of 1.76%/4.37% on dev and 1.91%/4.36% on test sets

03

No external language model used

Abstract

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsTest · Softmax · Linear Layer