Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies
Anis Hamadouche, Haifeng Luo, Mathini Sellathurai, Amir Hussain, Tharm Ratnarajah

TL;DR
This paper designs and evaluates a cloud-edge-assisted AVSE system over 5G, highlighting the importance of compute placement, uplink capacity, and compression for real-time performance in multimedia enhancement.
Contribution
It presents a complete AVSE system integrating CNN, OpenCV, and LSTM, deployed on a 5G edge cloud, with comprehensive performance analysis and practical deployment guidelines.
Findings
Edge compute placement is critical for real-time coherence.
Uplink capacity often limits performance in interactive AVSE.
Compression reduces payload size significantly with minimal perceptual loss.
Abstract
Real-time audio-visual speech enhancement (AVSE) is a key enabler for immersive and interactive multimedia services, yet its performance is tightly constrained by network latency, uplink capacity, and computational delay. This paper presents the design, deployment, and evaluation of a complete cloud-edge-assisted AVSE system operating over a public 5G edge network. The system integrates CNN-based acoustic enhancement and OpenCV-based facial feature extraction with an LSTM fusion network to preserve temporal coherence, and is deployed on a Vodafone-compatible AWS Wavelength edge cloud. Through extensive stress testing, we analyze end-to-end performance under varying network load and adaptive multimedia profiles. Results show that compute placement at the network edge is critical for meeting real-time coherence constraints, and that uplink capacity is often the dominant bottleneck for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
