SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers
Zhengcong Fei, Hao Jiang, Di Qiu, Baoxuan Gu, Youqiang Zhang, Jiahua Wang, Jialin Bai, Debang Li, Mingyuan Fan, Guibin Chen, Yahui Zhou

TL;DR
SkyReels-Audio is a novel framework that synthesizes high-quality, temporally coherent talking portrait videos conditioned on multimodal inputs, enabling flexible editing and long-duration generation with improved lip-sync and facial consistency.
Contribution
It introduces a unified multimodal diffusion-based framework with a hybrid curriculum learning strategy and new loss functions for high-fidelity, controllable, and long-duration talking portrait video synthesis.
Findings
Achieves superior lip-sync accuracy and identity consistency
Ensures visual fidelity and temporal coherence in extended videos
Performs well under complex and challenging conditions
Abstract
The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Skywork/SkyReels-V3-A2V-19Bmodel· 1.3k dl· ♡ 851.3k dl♡ 85
- 🤗Skywork/SkyReels-V3-V2V-14Bmodel· ♡ 14♡ 14
- 🤗Skywork/SkyReels-V3-R2V-14Bmodel· 338 dl· ♡ 40338 dl♡ 40
- 🤗vantagewithai/SkyReels-V3-14B-GGUFmodel· 4.6k dl· ♡ 84.6k dl♡ 8
- 🤗Notweare/Skyreels-v2model
- 🤗akkierocks007/SkyReels-V3-R2V-14Bmodel· 1 dl1 dl
- 🤗qqceqqq/SkyReels-V3-V2V-14Bmodel
- 🤗qqceqqq/SkyReels-V3-R2V-14Bmodel· 1 dl1 dl
- 🤗qqceqqq/SkyReels-V3-A2V-19Bmodel
- 🤗Frederic75/SkyReels-V3-14B-GGUFmodel· 208 dl· ♡ 1208 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadio, Podcasts, and Digital Media · Multimedia Communication and Technology · Subtitles and Audiovisual Media
MethodsDiffusion · ALIGN
