SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

Zhengcong Fei; Hao Jiang; Di Qiu; Baoxuan Gu; Youqiang Zhang; Jiahua Wang; Jialin Bai; Debang Li; Mingyuan Fan; Guibin Chen; Yahui Zhou

arXiv:2506.00830·cs.CV·June 3, 2025

SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

Zhengcong Fei, Hao Jiang, Di Qiu, Baoxuan Gu, Youqiang Zhang, Jiahua Wang, Jialin Bai, Debang Li, Mingyuan Fan, Guibin Chen, Yahui Zhou

PDF

Open Access 3 Repos 10 Models

TL;DR

SkyReels-Audio is a novel framework that synthesizes high-quality, temporally coherent talking portrait videos conditioned on multimodal inputs, enabling flexible editing and long-duration generation with improved lip-sync and facial consistency.

Contribution

It introduces a unified multimodal diffusion-based framework with a hybrid curriculum learning strategy and new loss functions for high-fidelity, controllable, and long-duration talking portrait video synthesis.

Findings

01

Achieves superior lip-sync accuracy and identity consistency

02

Ensures visual fidelity and temporal coherence in extended videos

03

Performs well under complex and challenging conditions

Abstract

The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadio, Podcasts, and Digital Media · Multimedia Communication and Technology · Subtitles and Audiovisual Media

MethodsDiffusion · ALIGN