Structured State Space Decoder for Speech Recognition and Synthesis

Koichi Miyazaki; Masato Murata; Tomoki Koriyama

arXiv:2210.17098·cs.SD·November 1, 2022·1 cites

Structured State Space Decoder for Speech Recognition and Synthesis

Koichi Miyazaki, Masato Murata, Tomoki Koriyama

PDF

Open Access

TL;DR

This paper introduces a structured state space model (S4) as a decoder for speech recognition and synthesis, demonstrating competitive performance and robustness over traditional Transformer decoders.

Contribution

The study applies the S4 model as a decoder in ASR and TTS, showing its advantages in long-sequence modeling and robustness compared to Transformers.

Findings

01

Achieves 1.88% WER on LibriSpeech test-clean

02

Outperforms Transformer in long-form speech robustness

03

Outperforms Transformer baseline in TTS tasks

Abstract

Automatic speech recognition (ASR) systems developed in recent years have shown promising results with self-attention models (e.g., Transformer and Conformer), which are replacing conventional recurrent neural networks. Meanwhile, a structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks, including raw speech classification. The S4 model can be trained in parallel, same as the Transformer model. In this study, we applied S4 as a decoder for ASR and text-to-speech (TTS) tasks by comparing it with the Transformer decoder. For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25% on LibriSpeech test-clean/test-other set and a character error rate (CER) of 3.80%/2.63%/2.98% on the CSJ eval1/eval2/eval3 set. Furthermore, the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization