Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement

Rong Chao; Wenze Ren; You-Jin Li; Kuo-Hsuan Hung; Sung-Feng Huang; Szu-Wei Fu; Wen-Huang Cheng; and Yu Tsao

arXiv:2508.13624·cs.SD·October 1, 2025

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement

Rong Chao, Wenze Ren, You-Jin Li, Kuo-Hsuan Hung, Sung-Feng Huang, Szu-Wei Fu, Wen-Huang Cheng, and Yu Tsao

PDF

Open Access

TL;DR

This paper introduces AVSEMamba, an audio-visual speech enhancement model that combines full-face visual cues with a Mamba-based temporal backbone, significantly improving speech extraction in multi-speaker environments.

Contribution

It presents a novel AVSEMamba model that integrates visual face information with Mamba architecture, advancing multi-speaker speech enhancement capabilities.

Findings

01

Outperforms monaural baselines in speech intelligibility, perceptual quality, and non-intrusive quality.

02

Achieves 1st place on the AVSEC-4 Challenge monaural leaderboard.

03

Effectively handles complex multi-speaker scenarios.

Abstract

Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves \textbf{1st place} on the monaural leaderboard.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Indoor and Outdoor Localization Technologies