SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model
Xinyuan Qian, Jiaran Gao, Yaodan Zhang, Qiquan Zhang, Hexin Liu,, Leibny Paola Garcia, Haizhou Li

TL;DR
This paper introduces SAV-SE, a novel scene-aware audio-visual speech enhancement method that leverages contextual visual cues from the environment to improve speech clarity, especially in occluded or distant camera scenarios.
Contribution
It pioneers the use of environmental visual context as auxiliary information for speech enhancement, integrating Conformer and Mamba modules for improved performance.
Findings
Outperforms existing methods on MUSIC, AVSpeech, and AudioSet datasets.
Demonstrates the effectiveness of environmental visual cues in noisy scenarios.
Provides publicly available source code and demo.
Abstract
Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Image and Signal Denoising Methods
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
