Wan-S2V: Audio-Driven Cinematic Video Generation

Xin Gao; Li Hu; Siqi Hu; Mingyang Huang; Chaonan Ji; Dechao Meng; Jinwei Qi; Penchong Qiao; Zhen Shen; Yafei Song; Ke Sun; Linrui Tian; Guangyuan Wang; Qi Wang; Zhongjian Wang; Jiayu Xiao; Sheng Xu; Bang Zhang; Peng Zhang; Xindi Zhang; Zhe Zhang; Jingren Zhou; Lian Zhuo

arXiv:2508.18621·cs.CV·August 27, 2025

Wan-S2V: Audio-Driven Cinematic Video Generation

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, Lian Zhuo

PDF

5 Models

TL;DR

Wan-S2V is a novel audio-driven cinematic video generation model that significantly improves expressiveness and fidelity in complex film scenarios, outperforming existing methods like Hunyuan-Avatar and Omnihuman.

Contribution

The paper introduces Wan-S2V, a new model for audio-driven cinematic video generation that enhances realism and expressiveness beyond prior approaches.

Findings

01

Wan-S2V outperforms Hunyuan-Avatar and Omnihuman in benchmarks.

02

The method is versatile for long-form video generation.

03

It achieves high-quality lip-sync editing.

Abstract

Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.