Audio-Visual Cross-Modal Compression for Generative Face Video Coding

Youmin Xu; Mengxi Guo; Shijie Zhao; Weiqi Li; Junlin Li; Li Zhang; Jian Zhang

arXiv:2512.15262·eess.IV·December 18, 2025

Audio-Visual Cross-Modal Compression for Generative Face Video Coding

Youmin Xu, Mengxi Guo, Shijie Zhao, Weiqi Li, Junlin Li, Li Zhang, Jian Zhang

PDF

Open Access

TL;DR

This paper introduces AVCC, a novel framework that jointly compresses audio and video for face videos, leveraging cross-modal coherence to improve rate-distortion performance in low bitrate scenarios.

Contribution

The paper presents a unified audio-visual compression method that exploits audio-visual correlation via a diffusion process, enabling synchronized reconstruction and cross-modal generation.

Findings

01

AVCC outperforms VVC and existing GFVC methods in rate-distortion metrics.

02

In low-rate scenarios, AVCC can reconstruct one modality from the other.

03

The framework effectively exploits audio-visual coherence for efficient compression.

Abstract

Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established correlation between audio and lip movements, this cross-modal coherence has not been systematically exploited for compression. To address this, we propose an Audio-Visual Cross-Modal Compression (AVCC) framework that jointly compresses audio and video streams. Our framework extracts motion information from video and tokenizes audio features, then aligns them through a unified audio-video diffusion process. This allows synchronized reconstruction of both modalities from a shared representation. In extremely low-rate scenarios, AVCC can even reconstruct one modality from the other. Experiments show that AVCC significantly outperforms the Versatile Video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Data Compression Techniques · Video Coding and Compression Technologies