Do We Need EMA for Diffusion-Based Speech Enhancement? Toward a Magnitude-Preserving Network Architecture
Julius Richter, Danilo de Oliveira, Timo Gerkmann

TL;DR
This paper investigates the necessity of EMA in diffusion-based speech enhancement, proposing a magnitude-preserving architecture and analyzing skip-connection configurations, ultimately finding that short or no EMA improves performance.
Contribution
It introduces a magnitude-preserving network architecture and explores skip-connection designs, providing new insights into EMA's role in diffusion-based speech enhancement.
Findings
Short or no EMA yields better speech enhancement performance.
Skip-connection variants show complementary strengths.
The proposed architecture achieves competitive results on benchmark datasets.
Abstract
We study diffusion-based speech enhancement using a Schrodinger bridge formulation and extend the EDM2 framework to this setting. We employ time-dependent preconditioning of network inputs and outputs to stabilize training and explore two skip-connection configurations that allow the network to predict either environmental noise or clean speech. To control activation and weight magnitudes, we adopt a magnitude-preserving architecture and learn the contribution of the noisy input within each network block for improved conditioning. We further analyze the impact of exponential moving average (EMA) parameter smoothing by approximating different EMA profiles post training, finding that, unlike in image generation, short or absent EMA consistently yields better speech enhancement performance. Experiments on VoiceBank-DEMAND and EARS-WHAM demonstrate competitive signal-to-distortion ratios…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Speech Recognition and Synthesis
