Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations
T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang

TL;DR
This paper introduces RAVEN, a real-time audio-visual speech enhancement system that uses pre-trained visual embeddings to improve speech clarity in noisy, multi-speaker environments, and provides the first open-source implementation.
Contribution
The paper presents a novel real-time AVSE system utilizing pre-trained visual embeddings, demonstrating their effectiveness across various noisy and multi-speaker scenarios.
Findings
Concatenating AVSR and ASD embeddings yields the best performance in low-SNR, multi-speaker environments.
AVSR embeddings alone perform best in noise-only scenarios.
First open-source real-time AVSE system implementation.
Abstract
Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
