RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

Honglie Chen; Rodrigo Mira; Stavros Petridis; Maja Pantic

arXiv:2407.07825·cs.SD·July 11, 2024

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic

PDF

Open Access

TL;DR

This paper introduces RT-LA-VocE, a real-time audio-visual speech enhancement system that operates with minimal latency, utilizing novel causal encoders and a neural vocoder to improve speech clarity from noisy streams.

Contribution

The paper presents a fully causal, real-time speech enhancement model based on redesigning LA-VocE with new encoders, the Emformer, and a causal neural vocoder, achieving state-of-the-art results.

Findings

01

Achieves state-of-the-art results on AVSpeech dataset.

02

Maintains low end-to-end latency of 28.15ms per frame.

03

Operates effectively with a 40ms input frame in real-time.

Abstract

In this paper, we aim to generate clean speech frame by frame from a live video stream and a noisy audio stream without relying on future inputs. To this end, we propose RT-LA-VocE, which completely re-designs every component of LA-VocE, a state-of-the-art non-causal audio-visual speech enhancement model, to perform causal real-time inference with a 40ms input frame. We do so by devising new visual and audio encoders that rely solely on past frames, replacing the Transformer encoder with the Emformer, and designing a new causal neural vocoder C-HiFi-GAN. On the popular AVSpeech dataset, we show that our algorithm achieves state-of-the-art results in all real-time scenarios. More importantly, each component is carefully tuned to minimize the algorithm latency to the theoretical minimum (40ms) while maintaining a low end-to-end processing latency of 28.15ms per frame, enabling real-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Adam · Dropout