SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

Yochai Yemini; Yoav Ellinson; Rami Ben-Ari; Sharon Gannot; Ethan Fetaya

arXiv:2602.01394·eess.AS·February 3, 2026

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

Yochai Yemini, Yoav Ellinson, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya

PDF

Open Access

TL;DR

SSNAPS introduces a novel unsupervised audio-visual speech separation method using diffusion inverse sampling, outperforming supervised baselines in noisy environments and enabling applications like acoustic scene detection.

Contribution

The paper presents a new generative inverse sampling approach with diffusion priors for unsupervised speech and noise separation in complex audio-visual scenarios.

Findings

01

Outperforms supervised baselines in word error rate across various noisy conditions

02

Effectively separates off-screen speakers and ambient noise

03

Produces high-fidelity noise components suitable for scene detection

Abstract

This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Hearing Loss and Rehabilitation