RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic

TL;DR
This paper introduces RT-LA-VocE, a real-time audio-visual speech enhancement system that operates with minimal latency, utilizing novel causal encoders and a neural vocoder to improve speech clarity from noisy streams.
Contribution
The paper presents a fully causal, real-time speech enhancement model based on redesigning LA-VocE with new encoders, the Emformer, and a causal neural vocoder, achieving state-of-the-art results.
Findings
Achieves state-of-the-art results on AVSpeech dataset.
Maintains low end-to-end latency of 28.15ms per frame.
Operates effectively with a 40ms input frame in real-time.
Abstract
In this paper, we aim to generate clean speech frame by frame from a live video stream and a noisy audio stream without relying on future inputs. To this end, we propose RT-LA-VocE, which completely re-designs every component of LA-VocE, a state-of-the-art non-causal audio-visual speech enhancement model, to perform causal real-time inference with a 40ms input frame. We do so by devising new visual and audio encoders that rely solely on past frames, replacing the Transformer encoder with the Emformer, and designing a new causal neural vocoder C-HiFi-GAN. On the popular AVSpeech dataset, we show that our algorithm achieves state-of-the-art results in all real-time scenarios. More importantly, each component is carefully tuned to minimize the algorithm latency to the theoretical minimum (40ms) while maintaining a low end-to-end processing latency of 28.15ms per frame, enabling real-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Adam · Dropout
