Geneses: Unified Generative Speech Enhancement and Separation
Kohei Asai, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari

TL;DR
Geneses introduces a generative framework that unifies speech enhancement and separation, effectively handling complex degradations and outperforming traditional methods in two-speaker mixture scenarios.
Contribution
It presents a novel generative approach using latent flow matching and diffusion Transformers for unified speech enhancement and separation.
Findings
Outperforms conventional mask-based methods on LibriTTS-R mixtures
Demonstrates robustness against complex degradations
Achieves significant improvements in objective metrics
Abstract
Real-world audio recordings often contain multiple speakers and various degradations, which limit both the quantity and quality of speech data available for building state-of-the-art speech processing models. Although end-to-end approaches that concatenate speech enhancement (SE) and speech separation (SS) to obtain a clean speech signal for each speaker are promising, conventional SE-SS methods suffer from complex degradations beyond additive noise. To this end, we propose \textbf{Geneses}, a generative framework to achieve unified, high-quality SE--SS. Our Geneses leverages latent flow matching to estimate each speaker's clean speech features using multi-modal diffusion Transformer conditioned on self-supervised learning representation from noisy mixture. We conduct experimental evaluation using two-speaker mixtures from LibriTTS-R under two conditions: additive-noise-only and complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
