SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Guibin Chen; Dixuan Lin; Jiangping Yang; Youqiang Zhang; Zhengcong Fei; Debang Li; Sheng Chen; Chaofeng Ao; Nuo Pang; Yiming Wang; Yikun Dou; Zheng Chen; Mingyuan Fan; Tuanhui Li; Mingshan Chang; Hao Zhang; Xiaopeng Sun; Jingtao Xu; Yuqiang Xie; Jiahua Wang; Zhiheng Xu; Weiming Xiong; Yuzhe Jin; Baoxuan Gu; Binjie Mao; Yunjie Yu; Jujie He; Yuhao Feng; Shiwen Tu; Chaojie Wang; Rui Yan; Wei Shen; Jingchen Wu; Peng Zhao; Xuanyue Zhong; Zhuangzhuang Liu; Kaifei Wang; Fuxiang Zhang; Weikai Xu; Wenyan Liu; Binglu Zhang; Yu Shen; Tianhui Xiong; Bin Peng; Liang Zeng; Xuchen Song; Haoxiang Guo; Peiyu Wang; Max W. Y. Lam; Chien-Hung Liu; Yahui Zhou

arXiv:2602.21818·cs.CV·March 19, 2026

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, Yikun Dou, Zheng Chen, Mingyuan Fan, Tuanhui Li, Mingshan Chang, Hao Zhang, Xiaopeng Sun, Jingtao Xu, Yuqiang Xie, Jiahua Wang, Zhiheng Xu

PDF

Open Access

TL;DR

SkyReels V4 is a pioneering multi-modal video foundation model that enables high-quality, synchronized video and audio generation, inpainting, and editing at cinematic resolutions using a unified architecture and innovative efficiency strategies.

Contribution

The paper introduces SkyReels V4, the first model to jointly support multi-modal input, video-audio generation, and unified inpainting and editing with high efficiency and quality at cinematic scales.

Findings

01

Supports up to 1080p, 32 FPS, 15s videos with high fidelity.

02

Unifies various inpainting tasks under a single interface.

03

Employs a novel efficiency strategy for high-resolution, long-duration generation.

Abstract

SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MLLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MLLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Advanced Image Processing Techniques