DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers
Tianyu Cao, Helin Wang, Ari Frummer, Yuval Sieradzki, Adi Arbel, Laureano Moro Velazquez, Jesus Villalba, Oren Gal, Thomas Thebaud, Najim Dehak

TL;DR
DiT-Flow is a novel speech enhancement framework utilizing flow matching and diffusion transformers in latent space, achieving robustness across diverse distortions and outperforming existing models in realistic datasets.
Contribution
Introduces DiT-Flow, a flow matching-based speech enhancement model using latent diffusion transformers, with parameter-efficient training for robustness to multiple distortions.
Findings
Outperforms state-of-the-art generative SE models.
Robust to multiple unseen distortions.
Uses only 4.9% of parameters for high performance.
Abstract
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Generative Adversarial Networks and Image Synthesis
