A Multi-dimensional Deep Structured State Space Approach to Speech   Enhancement Using Small-footprint Models

Pin-Jui Ku; Chao-Han Huck Yang; Sabato Marco Siniscalchi; Chin-Hui Lee

arXiv:2306.00331·eess.AS·August 28, 2023·1 cites

A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-dimensional structured state space (S4) approach for speech enhancement, focusing on small-footprint models that effectively capture spectral dependencies across frequency and time domains, achieving competitive performance with fewer parameters.

Contribution

The paper develops a novel multi-dimensional S4-based architecture with whitening transformation, creating small, efficient models for speech enhancement that outperform traditional convolutional models in size and maintain high quality.

Findings

01

The TF-domain S4 model is 78.6% smaller than U-net while achieving similar PESQ scores.

02

Increasing model size improves PESQ score from 3.15 to 3.18.

03

The 2-D S4 layer acts as an infinite receptive field convolutional layer with fewer parameters.

Abstract

We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF) domains. The 2-D S4 layer can be considered a particular convolutional layer with an infinite receptive field although it utilizes fewer parameters than a conventional convolutional layer. Evaluated on the VoiceBank-DEMAND data set, when compared with the conventional U-net model based on convolutional layers, the proposed TF-domain S4-based model is 78.6% smaller in size, yet it still achieves competitive results with a PESQ score of 3.15 with data augmentation. By increasing the model size,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kuray107/s4nd-u-net_speech_enhancement
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Infant Health and Development

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Concatenated Skip Connection · Max Pooling · U-Net · Focus