Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

Jisoo Park; Seonghak Lee; Guisik Kim; Taewoo Kim; Junseok Kwon

arXiv:2512.06689·cs.CV·December 9, 2025

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

Jisoo Park, Seonghak Lee, Guisik Kim, Taewoo Kim, Junseok Kwon

PDF

Open Access

TL;DR

UniVoiceLite is a lightweight, unsupervised audio-visual model that unifies speech enhancement and separation, leveraging lip cues and Wasserstein regularization for robust, scalable performance in noisy, multi-speaker environments.

Contribution

It introduces a novel, unified, and unsupervised framework for speech enhancement and separation using audio-visual cues and Wasserstein regularization, reducing model complexity.

Findings

01

Achieves strong performance in noisy and multi-speaker scenarios

02

Operates efficiently with a lightweight model

03

Demonstrates robust generalization without paired data

Abstract

Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Advanced Adaptive Filtering Techniques