JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

Xiaohu Huang; Hao Zhou; Qiangpeng Yang; Shilei Wen; Kai Han

arXiv:2512.13677·cs.CV·December 16, 2025

JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han

PDF

Open Access

TL;DR

JoVA is a unified transformer-based framework that enables joint video-audio generation, including lip-synced speech and high-quality video, by employing cross-modal self-attention and a novel mouth-area loss.

Contribution

JoVA introduces a simple, effective approach for joint video-audio generation with direct cross-modal interaction and improved lip-speech synchronization without complex fusion modules.

Findings

01

Outperforms existing methods in lip-sync accuracy

02

Achieves high speech quality and video fidelity

03

Demonstrates effectiveness on benchmark datasets

Abstract

In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis