Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Anis Hamadouche; Haifeng Luo; Mathini Sellathurai; Amir Hussain; Tharm Ratnarajah

arXiv:2508.08468·cs.SD·April 29, 2026

Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Anis Hamadouche, Haifeng Luo, Mathini Sellathurai, Amir Hussain, Tharm Ratnarajah

PDF

TL;DR

This paper designs and evaluates a cloud-edge-assisted AVSE system over 5G, highlighting the importance of compute placement, uplink capacity, and compression for real-time performance in multimedia enhancement.

Contribution

It presents a complete AVSE system integrating CNN, OpenCV, and LSTM, deployed on a 5G edge cloud, with comprehensive performance analysis and practical deployment guidelines.

Findings

01

Edge compute placement is critical for real-time coherence.

02

Uplink capacity often limits performance in interactive AVSE.

03

Compression reduces payload size significantly with minimal perceptual loss.

Abstract

Real-time audio-visual speech enhancement (AVSE) is a key enabler for immersive and interactive multimedia services, yet its performance is tightly constrained by network latency, uplink capacity, and computational delay. This paper presents the design, deployment, and evaluation of a complete cloud-edge-assisted AVSE system operating over a public 5G edge network. The system integrates CNN-based acoustic enhancement and OpenCV-based facial feature extraction with an LSTM fusion network to preserve temporal coherence, and is deployed on a Vodafone-compatible AWS Wavelength edge cloud. Through extensive stress testing, we analyze end-to-end performance under varying network load and adaptive multimedia profiles. Results show that compute placement at the network edge is critical for meeting real-time coherence constraints, and that uplink capacity is often the dominant bottleneck for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.