SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model
Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, Yikun Dou, Zheng Chen, Mingyuan Fan, Tuanhui Li, Mingshan Chang, Hao Zhang, Xiaopeng Sun, Jingtao Xu, Yuqiang Xie, Jiahua Wang, Zhiheng Xu

TL;DR
SkyReels V4 is a pioneering multi-modal video foundation model that enables high-quality, synchronized video and audio generation, inpainting, and editing at cinematic resolutions using a unified architecture and innovative efficiency strategies.
Contribution
The paper introduces SkyReels V4, the first model to jointly support multi-modal input, video-audio generation, and unified inpainting and editing with high efficiency and quality at cinematic scales.
Findings
Supports up to 1080p, 32 FPS, 15s videos with high fidelity.
Unifies various inpainting tasks under a single interface.
Employs a novel efficiency strategy for high-resolution, long-duration generation.
Abstract
SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MLLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MLLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Advanced Image Processing Techniques
