UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling

Ziqian Wang; Zikai Liu; Yike Zhu; Xingchen Li; Boyi Kang; Jixun Yao; Xianjun Xia; Chuanzeng Huang; Lei Xie

arXiv:2508.07558·eess.AS·August 12, 2025

UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling

Ziqian Wang, Zikai Liu, Yike Zhu, Xingchen Li, Boyi Kang, Jixun Yao, Xianjun Xia, Chuanzeng Huang, Lei Xie

PDF

Open Access

TL;DR

UniFlow introduces a unified generative modeling framework for various speech front-end tasks, leveraging a shared latent space and conditional training to improve performance and extensibility across multiple benchmarks.

Contribution

It presents a novel unified approach using continuous generative modeling with a waveform VAE and Diffusion Transformer for diverse speech tasks in a shared latent space.

Findings

01

Consistent performance improvements over state-of-the-art baselines.

02

Effective task differentiation using learnable condition embeddings.

03

Demonstrated extensibility to new speech processing tasks.

Abstract

Generative modeling has recently achieved remarkable success across image, video, and audio domains, demonstrating powerful capabilities for unified representation learning. Yet speech front-end tasks such as speech enhancement (SE), target speaker extraction (TSE), acoustic echo cancellation (AEC), and language-queried source separation (LASS) remain largely tackled by disparate, task-specific solutions. This fragmentation leads to redundant engineering effort, inconsistent performance, and limited extensibility. To address this gap, we introduce UniFlow, a unified framework that employs continuous generative modeling to tackle diverse speech front-end tasks in a shared latent space. Specifically, UniFlow utilizes a waveform variational autoencoder (VAE) to learn a compact latent representation of raw audio, coupled with a Diffusion Transformer (DiT) that predicts latent updates. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Generative Adversarial Networks and Image Synthesis