A Systematic Post-Train Framework for Video Generation

Zeyue Xue; Siming Fu; Jie Huang; Shuai Lu; Haoran Li; Yijun Liu; Yuming Li; Xiaoxuan He; Mengzhao Chen; Haoyang Huang; Nan Duan; Ping Luo

arXiv:2604.25427·cs.CV·April 29, 2026

A Systematic Post-Train Framework for Video Generation

Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li, Yijun Liu, Yuming Li, Xiaoxuan He, Mengzhao Chen, Haoyang Huang, Nan Duan, Ping Luo

PDF

TL;DR

This paper introduces a comprehensive post-training framework for video diffusion models that enhances stability, temporal coherence, and controllability, addressing deployment challenges like prompt sensitivity and inference costs.

Contribution

It proposes a systematic pipeline combining fine-tuning, reinforcement learning, prompt refinement, and inference optimization to improve real-world video generation quality.

Findings

01

Significant improvement in visual quality and temporal coherence.

02

Enhanced controllability and instruction following.

03

Reduced sampling costs while maintaining high-quality outputs.

Abstract

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.