Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao; Haoyuan Guo; Tuyen Hoang; Weilin Huang; Lu Jiang; Fangyuan Kong; Huixia Li; Jiashi Li; Liang Li; Xiaojie Li; Xunsong Li; Yifu Li; Shanchuan Lin; Zhijie Lin; Jiawei Liu; Shu Liu; Xiaonan Nie; Zhiwu Qing; Yuxi Ren; Li Sun; Zhi Tian; Rui Wang; Sen Wang; Guoqiang Wei; Guohong Wu; Jie Wu; Ruiqi Xia; Fei Xiao; Xuefeng Xiao; Jiangqiao Yan; Ceyuan Yang; Jianchao Yang; Runkai Yang; Tao Yang; Yihang Yang; Zilyu Ye; Xuejiao Zeng; Yan Zeng; Heng Zhang; Yang Zhao; Xiaozheng Zheng; Peihao Zhu; Jiaxin Zou; Feilong Zuo

arXiv:2506.09113·cs.CV·July 1, 2025

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei

PDF

Open Access

TL;DR

Seedance 1.0 is a high-performance, efficient video generation model that advances the field by integrating diverse data, innovative architecture, and optimized training to produce high-quality, coherent videos rapidly.

Contribution

It introduces a novel architecture and training paradigm supporting multi-shot, text-to-video, and image-to-video tasks with significant speed and quality improvements.

Findings

01

Achieves ~10x inference speedup through distillation and system optimization.

02

Generates 5-second 1080p videos in 41.4 seconds with high quality.

03

Outperforms state-of-the-art models in spatiotemporal fluidity and coherence.

Abstract

Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis