WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues Peservation
Lu Han, Junqi Zhao, Renhua Peng

TL;DR
WTFormer is a neural network that combines wavelet transforms and Conformer architecture to enhance MIMO speech signals while preserving spatial cues, achieving high performance with fewer parameters.
Contribution
The paper introduces WTFormer, a novel MIMO speech enhancement model that integrates wavelet-based multi-resolution analysis and multi-dimensional attention for improved spatial cue preservation.
Findings
Achieves comparable denoising performance to state-of-the-art systems.
Preserves more spatial information during enhancement.
Operates with only 0.98 million parameters.
Abstract
Current multi-channel speech enhancement systems mainly adopt single-output architecture, which face significant challenges in preserving spatio-temporal signal integrity during multiple-input multiple-output (MIMO) processing. To address this limitation, we propose a novel neural network, termed WTFormer, for MIMO speech enhancement that leverages the multi-resolution characteristics of wavelet transform and multi-dimensional collaborative attention to effectively capture globally distributed spatial features, while using Conformer for time-frequency modeling. A multi task loss strategy accompanying MUSIC algorithm is further proposed for optimization training to protect spatial information to the greatest extent. Experimental results on the LibriSpeech dataset show that WTFormer can achieve comparable denoising performance to advanced systems while preserving more spatial information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Speech Recognition and Synthesis
