JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, Tat-Seng Chua

TL;DR
JavisDiT++ is a unified framework for joint audio-video generation that improves quality, synchronization, and human preference alignment using a novel mixture-of-experts design, temporal synchronization strategy, and preference optimization.
Contribution
The paper introduces a novel unified modeling and optimization framework for joint audio-video generation, incorporating MS-MoE, TA-RoPE, and AV-DPO methods, achieving state-of-the-art results.
Findings
Achieves state-of-the-art performance with only 1M training entries.
Significantly outperforms prior approaches in quality and synchronization.
Validated effectiveness through comprehensive ablation studies.
Abstract
AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit,…
Peer Reviews
Decision·ICLR 2026 Poster
- The joint modeling of audio and video tokens with separate FFN layers is simple and elegant, although these ideas already exist in prior works (e.g., BAGEL) under different contexts. - The presented model has a strong empirical performance, setting a new state-of-the-art for the challenging problem of joint audio-video generation. - The study of different finetuning strategies (e.g., Lora with different ranks) is interesting. It provides additional context and insights to the readers. - The ab
- The proposed network seems to be significantly larger than the networks used in prior works. Can the authors also include the parameter counts and runtime in Table 1 for a more comprehensive comparison? - The gains in Table 4 from using DPO seem to be very limited, except in FVD. How important is DPO? Are there any user studies or qualitative examples that support the use of DPO? - There are no details on training data filtering. Section D2 contains a high-level sketch, but it does not menti
- The proposed model design (MS-MOE with LoRA finetuning) is simple yet effective, achieving both a high cross-modal alignment and a high single-modal generation quality. It is also computationally efficient, as inference cost per token remains constant while the parameter size increases. - The proposed TA-RoPE requires minimal engineering efforts. It provides a natural extension of Wan's RoPE to support both inter- and intra-modal interactions of audio and video. - AV-DPO sounds novel, as apply
The primary concern lies in the **fairness and clarity of the experimental evaluation**. For Table 1: - It is unclear which T2A models are used for T2A+A2V and which T2V models are used for T2A+V2A. Also, many results of T2A+A2V and T2V+V2A baselines are missing. - The authors re-evaluate JavisDiT, but the reported scores deviate substantially from those in the original paper. For instance, TA-IB = 0.197 (original) vs. 0.151 (this paper), CLIP = 0.325 vs. 0.308, AV-IB = 0.201 vs. 0.197, AVHSco
- The proposed method introduces a novel approach for efficiently constructing joint audio-video generation models. While many prior works use a separate audio branch in addition to a video branch, the proposed model only adds dedicated FFNs for audio. - The proposed positional encoding is carefully designed to leverage pretrained video generation models while inducing strong audio-visual alignment via temporal encoding. - This work explores direct preference optimization for joint audio-visual
- While the idea of applying DPO to joint generation models is interesting, its effect appears to be relatively small compared to the other proposed modifications. It would be beneficial to investigate the potential causes of this result. - For example, if the quality of the preference data is a factor, this could be clarified by a subjective evaluation of a small subset of the preference data. If the amount of preference data is insufficient, showing performance as a function of the amount o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing
