SkyReels-V3 Technique Report
Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen, Jiangping Yang, Mingyuan Fan, Jingtao Xu, Jiahua Wang, Baoxuan Gu, Mingshan Chang, Wenjing Cai, Yuqiang Xie, Binjie Mao, Youqiang Zhang, Nuo Pang, Hao Zhang, Yuzhe Jin, Zhiheng Xu, Dixuan Lin, Guibin Chen, Yahui Zhou

TL;DR
SkyReels-V3 introduces a versatile multimodal video generation framework supporting reference-based, extension, and audio-guided synthesis, with advanced training strategies to enhance quality, coherence, and robustness across diverse scenarios.
Contribution
The paper presents SkyReels-V3, a unified diffusion Transformer-based model that supports multiple video generation paradigms within a single architecture, improving fidelity, coherence, and multimodal integration.
Findings
Achieves state-of-the-art performance on key metrics.
Supports high-fidelity, coherent, and diverse video generation.
Demonstrates robustness across various scenarios.
Abstract
Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Skywork/SkyReels-V3-A2V-19Bmodel· 1.3k dl· ♡ 851.3k dl♡ 85
- 🤗Skywork/SkyReels-V3-V2V-14Bmodel· ♡ 14♡ 14
- 🤗Skywork/SkyReels-V3-R2V-14Bmodel· 338 dl· ♡ 40338 dl♡ 40
- 🤗vantagewithai/SkyReels-V3-14B-GGUFmodel· 4.6k dl· ♡ 84.6k dl♡ 8
- 🤗akkierocks007/SkyReels-V3-R2V-14Bmodel· 1 dl1 dl
- 🤗qqceqqq/SkyReels-V3-V2V-14Bmodel
- 🤗qqceqqq/SkyReels-V3-R2V-14Bmodel· 1 dl1 dl
- 🤗qqceqqq/SkyReels-V3-A2V-19Bmodel
- 🤗Frederic75/SkyReels-V3-14B-GGUFmodel· 208 dl· ♡ 1208 dl♡ 1
- 🤗zhangxiang1209/SkyReels-V3-A2V-19Bmodel· 16 dl16 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
