Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

T. Aleksandra Ma; Sile Yin; Li-Chia Yang; Shuo Zhang

arXiv:2507.21448·eess.AS·August 5, 2025

Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang

PDF

TL;DR

This paper introduces RAVEN, a real-time audio-visual speech enhancement system that uses pre-trained visual embeddings to improve speech clarity in noisy, multi-speaker environments, and provides the first open-source implementation.

Contribution

The paper presents a novel real-time AVSE system utilizing pre-trained visual embeddings, demonstrating their effectiveness across various noisy and multi-speaker scenarios.

Findings

01

Concatenating AVSR and ASD embeddings yields the best performance in low-SNR, multi-speaker environments.

02

AVSR embeddings alone perform best in noise-only scenarios.

03

First open-source real-time AVSE system implementation.

Abstract

Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.