A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models
Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

TL;DR
This paper introduces a multi-dimensional structured state space (S4) approach for speech enhancement, focusing on small-footprint models that effectively capture spectral dependencies across frequency and time domains, achieving competitive performance with fewer parameters.
Contribution
The paper develops a novel multi-dimensional S4-based architecture with whitening transformation, creating small, efficient models for speech enhancement that outperform traditional convolutional models in size and maintain high quality.
Findings
The TF-domain S4 model is 78.6% smaller than U-net while achieving similar PESQ scores.
Increasing model size improves PESQ score from 3.15 to 3.18.
The 2-D S4 layer acts as an infinite receptive field convolutional layer with fewer parameters.
Abstract
We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF) domains. The 2-D S4 layer can be considered a particular convolutional layer with an infinite receptive field although it utilizes fewer parameters than a conventional convolutional layer. Evaluated on the VoiceBank-DEMAND data set, when compared with the conventional U-net model based on convolutional layers, the proposed TF-domain S4-based model is 78.6% smaller in size, yet it still achieves competitive results with a PESQ score of 3.15 with data augmentation. By increasing the model size,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Infant Health and Development
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Concatenated Skip Connection · Max Pooling · U-Net · Focus
