Geneses: Unified Generative Speech Enhancement and Separation

Kohei Asai; Wataru Nakata; Yuki Saito; Hiroshi Saruwatari

arXiv:2601.18456·cs.SD·January 27, 2026

Geneses: Unified Generative Speech Enhancement and Separation

Kohei Asai, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari

PDF

Open Access

TL;DR

Geneses introduces a generative framework that unifies speech enhancement and separation, effectively handling complex degradations and outperforming traditional methods in two-speaker mixture scenarios.

Contribution

It presents a novel generative approach using latent flow matching and diffusion Transformers for unified speech enhancement and separation.

Findings

01

Outperforms conventional mask-based methods on LibriTTS-R mixtures

02

Demonstrates robustness against complex degradations

03

Achieves significant improvements in objective metrics

Abstract

Real-world audio recordings often contain multiple speakers and various degradations, which limit both the quantity and quality of speech data available for building state-of-the-art speech processing models. Although end-to-end approaches that concatenate speech enhancement (SE) and speech separation (SS) to obtain a clean speech signal for each speaker are promising, conventional SE-SS methods suffer from complex degradations beyond additive noise. To this end, we propose \textbf{Geneses}, a generative framework to achieve unified, high-quality SE--SS. Our Geneses leverages latent flow matching to estimate each speaker's clean speech features using multi-modal diffusion Transformer conditioned on self-supervised learning representation from noisy mixture. We conduct experimental evaluation using two-speaker mixtures from LibriTTS-R under two conditions: additive-noise-only and complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis