JoVA: Unified Multimodal Learning for Joint Video-Audio Generation
Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han

TL;DR
JoVA is a unified transformer-based framework that enables joint video-audio generation, including lip-synced speech and high-quality video, by employing cross-modal self-attention and a novel mouth-area loss.
Contribution
JoVA introduces a simple, effective approach for joint video-audio generation with direct cross-modal interaction and improved lip-speech synchronization without complex fusion modules.
Findings
Outperforms existing methods in lip-sync accuracy
Achieves high speech quality and video fidelity
Demonstrates effectiveness on benchmark datasets
Abstract
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
