SkyReels-V3 Technique Report

Debang Li; Zhengcong Fei; Tuanhui Li; Yikun Dou; Zheng Chen; Jiangping Yang; Mingyuan Fan; Jingtao Xu; Jiahua Wang; Baoxuan Gu; Mingshan Chang; Wenjing Cai; Yuqiang Xie; Binjie Mao; Youqiang Zhang; Nuo Pang; Hao Zhang; Yuzhe Jin; Zhiheng Xu; Dixuan Lin; Guibin Chen; Yahui Zhou

arXiv:2601.17323·cs.CV·January 30, 2026

SkyReels-V3 Technique Report

Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen, Jiangping Yang, Mingyuan Fan, Jingtao Xu, Jiahua Wang, Baoxuan Gu, Mingshan Chang, Wenjing Cai, Yuqiang Xie, Binjie Mao, Youqiang Zhang, Nuo Pang, Hao Zhang, Yuzhe Jin, Zhiheng Xu, Dixuan Lin, Guibin Chen, Yahui Zhou

PDF

Open Access 10 Models

TL;DR

SkyReels-V3 introduces a versatile multimodal video generation framework supporting reference-based, extension, and audio-guided synthesis, with advanced training strategies to enhance quality, coherence, and robustness across diverse scenarios.

Contribution

The paper presents SkyReels-V3, a unified diffusion Transformer-based model that supports multiple video generation paradigms within a single architecture, improving fidelity, coherence, and multimodal integration.

Findings

01

Achieves state-of-the-art performance on key metrics.

02

Supports high-fidelity, coherent, and diverse video generation.

03

Demonstrates robustness across various scenarios.

Abstract

Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis