WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues Peservation

Lu Han; Junqi Zhao; Renhua Peng

arXiv:2506.22001·eess.AS·June 30, 2025

WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues Peservation

Lu Han, Junqi Zhao, Renhua Peng

PDF

Open Access

TL;DR

WTFormer is a neural network that combines wavelet transforms and Conformer architecture to enhance MIMO speech signals while preserving spatial cues, achieving high performance with fewer parameters.

Contribution

The paper introduces WTFormer, a novel MIMO speech enhancement model that integrates wavelet-based multi-resolution analysis and multi-dimensional attention for improved spatial cue preservation.

Findings

01

Achieves comparable denoising performance to state-of-the-art systems.

02

Preserves more spatial information during enhancement.

03

Operates with only 0.98 million parameters.

Abstract

Current multi-channel speech enhancement systems mainly adopt single-output architecture, which face significant challenges in preserving spatio-temporal signal integrity during multiple-input multiple-output (MIMO) processing. To address this limitation, we propose a novel neural network, termed WTFormer, for MIMO speech enhancement that leverages the multi-resolution characteristics of wavelet transform and multi-dimensional collaborative attention to effectively capture globally distributed spatial features, while using Conformer for time-frequency modeling. A multi task loss strategy accompanying MUSIC algorithm is further proposed for optimization training to protect spatial information to the greatest extent. Experimental results on the LibriSpeech dataset show that WTFormer can achieve comparable denoising performance to advanced systems while preserving more spatial information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Speech Recognition and Synthesis