Multimodal Semantic Communication for Generative Audio-Driven Video   Conferencing

Haonan Tong; Haopeng Li; Hongyang Du; Zhaohui Yang; Changchuan Yin,; and Dusit Niyato

arXiv:2410.22112·cs.MM·October 30, 2024

Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

Haonan Tong, Haopeng Li, Hongyang Du, Zhaohui Yang, Changchuan Yin,, and Dusit Niyato

PDF

Open Access

TL;DR

This paper introduces Wav2Vid, a multimodal communication system that transmits audio and minimal video data to generate realistic talking head videos, significantly reducing data transmission in video conferencing.

Contribution

The paper presents a novel wave-to-video system that generates talking head videos from audio, reducing transmitted data by up to 83% while maintaining quality.

Findings

01

Data transmission reduced by up to 83%.

02

Generated videos maintain perceptual quality.

03

Efficient synchronization of audio and video data.

Abstract

This paper studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems