SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State   Space Model

Xinyuan Qian; Jiaran Gao; Yaodan Zhang; Qiquan Zhang; Hexin Liu,; Leibny Paola Garcia; Haizhou Li

arXiv:2411.07751·cs.SD·April 3, 2025

SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Xinyuan Qian, Jiaran Gao, Yaodan Zhang, Qiquan Zhang, Hexin Liu,, Leibny Paola Garcia, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces SAV-SE, a novel scene-aware audio-visual speech enhancement method that leverages contextual visual cues from the environment to improve speech clarity, especially in occluded or distant camera scenarios.

Contribution

It pioneers the use of environmental visual context as auxiliary information for speech enhancement, integrating Conformer and Mamba modules for improved performance.

Findings

01

Outperforms existing methods on MUSIC, AVSpeech, and AudioSet datasets.

02

Demonstrates the effectiveness of environmental visual cues in noisy scenarios.

03

Provides publicly available source code and demo.

Abstract

Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Image and Signal Denoising Methods

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces