MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team: Donghua Yu; Mingshu Chen; Qi Chen; Qi Luo; Qianyi Wu; Qinyuan Cheng; Ruixiao Li; Tianyi Liang; Wenbo Zhang; Wenming Tu; Xiangyu Peng; Yang Gao; Yanru Huo; Ying Zhu; Yinze Luo; Yiyang Zhang; Yuerong Song; Zhe Xu; Zhiyu Zhang; Chenchen Yang; Cheng Chang; Chushu Zhou; Hanfu Chen; Hongnan Ma; Jiaxi Li; Jingqi Tong; Junxi Liu; Ke Chen; Shimin Li; Shiqi Jiang; Songlin Wang; Wei Jiang; Zhaoye Fei; Zhiyuan Ning; Chunguo Li; Chenhui Li; Ziwei He; Zengfeng Huang; Xie Chen; Xipeng Qiu

arXiv:2602.08794·cs.CV·February 11, 2026

MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team: Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu Peng, Yang Gao, Yanru Huo, Ying Zhu, Yinze Luo, Yiyang Zhang, Yuerong Song, Zhe Xu, Zhiyu Zhang, Chenchen Yang, Cheng Chang

PDF

Open Access 4 Models 1 Datasets

TL;DR

MOVA is an open-source, scalable model that generates synchronized audio-visual content, including speech, sound effects, and music, using a Mixture-of-Experts architecture to advance research and creative applications.

Contribution

The paper introduces MOVA, a novel open-source model with a large MoE architecture supporting synchronized audio-visual generation, addressing limitations of previous closed-source systems.

Findings

01

Supports high-quality, synchronized lip-synced speech and sound effects

02

Employs a 32B parameter MoE architecture with 18B active during inference

03

Provides comprehensive tools for efficient inference and fine-tuning

Abstract

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

zhiyuzhang-0212/MOVA_benchmark_for_arena
dataset· 898 dl
898 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Multimodal Machine Learning Applications