Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
Changlin Li, Jiawei Zhang, Shuhao Liu, Sihao Lin, Zeyi Shi, Zhihui Li, Xiaojun Chang

TL;DR
This paper presents Ent-Prog, an efficient training framework for human video diffusion models that reduces training time and memory usage by prioritizing critical components and adaptively increasing complexity, without sacrificing quality.
Contribution
It introduces Conditional Entropy Inflation and an adaptive progressive schedule to optimize training efficiency for human video generation models.
Findings
Achieves up to 2.2× training speedup
Reduces GPU memory by 2.4×
Maintains high-quality video generation
Abstract
Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The idea of estimating block importance and progressively unfreezing from high-priority blocks is intuitive and straightforward, and its ease of being retrofitted to existing models makes it practically useful. - This adaptive scheduling reduces trial-and-error, balances quality and compute under a fixed budget, and is validated by experiments, highlighting its practical utility.
- While the paper focuses on human video generation, the proposed approach, estimating block importance and progressively unfreezing high-priority components, appears domain-agnostic. The authors should clarify what adaptations, if any, are specific to the human-video setting (e.g., choices tied to identity preservation or pose adherence). Alternatively, if the method is intended to be general, its effectiveness should be validated on additional video generation tasks beyond humans. - The paper
- Leveraging CEI for PPL demonstrates a certain level of insight and novelty. - The experimental results show a significant improvement, balancing both quality and training efficiency. - The paper is well-organized and easy to read.
I am concerned that there may be a critical factual error in the paper. The authors claim in line 206,207 that `Since diffusion models are trained to predict Gaussian noise, we expect the distribution of ε to be approximately Gaussian`, which is the key to applying CEI in practice. However, the epsilon-prediction in diffusion models does not follow a Gaussian distribution. On the contrary, it is highly related to the data distribution [1]. Given this, how can the authors use Eq. (2) to approxima
1. The proposed method delivers impressive practical benefits, achieving up to a 2.2× training speedup and a 2.4× GPU memory reduction. 2. CEI is a new metric for assessing the importance of blocks. 3. The introduction of the adaptive progressive schedule using a Nested Diffusion Supernet to dynamically select the optimal number of blocks to unfreeze based on convergence efficiency is well-motivated and demonstrates superior performance compared to simply removing the schedule.
1. Missing CEI Distribution Analysis: The paper lacks an in-depth analysis or visualization of CEI distribution to confirm whether only a few blocks truly have high priority scores, which is crucial. 2. Missing details of baseline 3. Missing comparison to LoRA-based finetuning and other related works. 4. Missing the comparison on training time (hours) 5. How is the proposed method special for human video generation? This seems to be a general training accelration method to me. 6. How does the pr
1. The method can largely reduce the training cost and memory without quality degradation. 2. The proposed CEI metric can measure the importance of each layer block.
1. While the CEI is defined as difference of entropy, in practice it is implemented by calculating the ratio of standard deviation in a statistical way, where about 1k samples are sampled. This seems to be inaccurate and the samples are too few. In addition, the author should compare the proposed CEI with other model pruning or layer scoring methods. 2. Another major concern is in experiments. There is no comparison to other competitive methods, like animate anyone, champ, stable animator, etc.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
