Speech enhancement deep-learning architecture for efficient edge   processing

Monisankha Pal; Arvind Ramanathan; Ted Wada; Ashutosh Pandey

arXiv:2405.16834·eess.AS·May 28, 2024·1 cites

Speech enhancement deep-learning architecture for efficient edge processing

Monisankha Pal, Arvind Ramanathan, Ted Wada, Ashutosh Pandey

PDF

Open Access

TL;DR

This paper introduces an efficient deep learning architecture for speech enhancement designed for low-power edge devices, combining multi-scale features, squeeze-excitation blocks, and a metric GAN to improve speech quality while reducing computational load.

Contribution

The proposed WSR-MGAN architecture effectively balances speech enhancement performance with computational efficiency suitable for edge devices, integrating novel multi-scale and attention mechanisms.

Findings

01

Outperforms baseline models on VoiceBank+DEMAND dataset

02

Achieves state-of-the-art results in time-domain speech enhancement

03

Maintains high speech quality with reduced computational complexity

Abstract

Deep learning has become a de facto method of choice for speech enhancement tasks with significant improvements in speech quality. However, real-time processing with reduced size and computations for low-power edge devices drastically degrades speech quality. Recently, transformer-based architectures have greatly reduced the memory requirements and provided ways to improve the model performance through local and global contexts. However, the transformer operations remain computationally heavy. In this work, we introduce WaveUNet squeeze-excitation Res2 (WSR)-based metric generative adversarial network (WSR-MGAN) architecture that can be efficiently implemented on low-power edge devices for noise suppression tasks while maintaining speech quality. We utilize multi-scale features using Res2Net blocks that can be related to spectral content used in speech-processing tasks. In the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Infant Health and Development

Methods1x1 Convolution · Residual Connection · Res2Net Block · Kaiming Initialization · Average Pooling · Global Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Convolution · Res2Net