Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing
Haonan Tong, Haopeng Li, Hongyang Du, Zhaohui Yang, Changchuan Yin,, and Dusit Niyato

TL;DR
This paper introduces Wav2Vid, a multimodal communication system that transmits audio and minimal video data to generate realistic talking head videos, significantly reducing data transmission in video conferencing.
Contribution
The paper presents a novel wave-to-video system that generates talking head videos from audio, reducing transmitted data by up to 83% while maintaining quality.
Findings
Data transmission reduced by up to 83%.
Generated videos maintain perceptual quality.
Efficient synchronization of audio and video data.
Abstract
This paper studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
