Structured State Space Decoder for Speech Recognition and Synthesis
Koichi Miyazaki, Masato Murata, Tomoki Koriyama

TL;DR
This paper introduces a structured state space model (S4) as a decoder for speech recognition and synthesis, demonstrating competitive performance and robustness over traditional Transformer decoders.
Contribution
The study applies the S4 model as a decoder in ASR and TTS, showing its advantages in long-sequence modeling and robustness compared to Transformers.
Findings
Achieves 1.88% WER on LibriSpeech test-clean
Outperforms Transformer in long-form speech robustness
Outperforms Transformer baseline in TTS tasks
Abstract
Automatic speech recognition (ASR) systems developed in recent years have shown promising results with self-attention models (e.g., Transformer and Conformer), which are replacing conventional recurrent neural networks. Meanwhile, a structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks, including raw speech classification. The S4 model can be trained in parallel, same as the Transformer model. In this study, we applied S4 as a decoder for ASR and text-to-speech (TTS) tasks by comparing it with the Transformer decoder. For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25% on LibriSpeech test-clean/test-other set and a character error rate (CER) of 3.80%/2.63%/2.98% on the CSJ eval1/eval2/eval3 set. Furthermore, the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization
