VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
Yutong Wang, Haiyu Zhang, Tianfan Xue, Yu Qiao, Yaohui Wang, Chang Xu, Xinyuan Chen

TL;DR
VDOT introduces an efficient, unified video creation model using optimal transport distillation, significantly reducing generation time while maintaining high quality, and providing a standardized benchmark for evaluation.
Contribution
The paper proposes a novel optimal transport-based distillation method for unified video creation, improving efficiency and stability over traditional KL-based approaches.
Findings
VDOT outperforms baselines with fewer steps
Achieves comparable quality with 100 steps using only 4 steps
Provides a new benchmark for unified video creation evaluation
Abstract
The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Visual Attention and Saliency Detection
