EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation
Xiaofeng Tan, Wanjiang Weng, Haodong Lei, and Hongsong Wang

TL;DR
EasyTune introduces a step-aware fine-tuning method for diffusion-based motion generation, improving alignment efficiency and memory usage by decoupling recursive denoising steps and employing self-refinement preference learning.
Contribution
The paper proposes EasyTune, a novel fine-tuning approach that decouples denoising steps and incorporates self-refinement for preference learning, enhancing efficiency and performance.
Findings
Outperforms DRaFT-50 by 8.2% in alignment.
Requires only 31.16% of DRaFT-50's memory overhead.
Achieves 7.3x faster training speed.
Abstract
In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from (1) inefficient and coarse-grained optimization with (2) high memory consumption. In this work, we first theoretically and empirically identify the key reason of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose EasyTune, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and fine-grained, and (2) memory-efficient optimization. Furthermore, the scarcity of preference motion pairs…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper clearly explains the main problem in existing diffusion fine-tuning methods and gives a straightforward way to make the optimization more efficient. 2. The experiments are thorough and show steady improvements on several benchmarks and models. 3. The method can be applied in practice since it does not rely on human-labeled data for reward learning. 4. The writing and presentation are clear and easy to follow. 5. The video results presented in the supplementary materials look dec
### Major Concerns 1. In the proposed Self-refinement Preference Learning (SPL), a pre-trained text-to-motion retrieval model is used for preference evaluation. It would be important to clarify how critical the choice of this model is. For example, if a weaker retrieval model were used, would the overall performance of EasyTune be significantly affected? Some discussion or analysis on this point would strengthen the argument, especially since the retrieval model is further fine-tuned. 2. While
- The paper addresses an important and timely problem in diffusion model fine-tuning, providing a thorough analysis of the limitations in existing differentiable reward methods. - The proposed EasyTune framework is conceptually clear and mathematically well-formulated, offering a principled solution to recursive gradient and memory inefficiency issues. - The authors conduct extensive experiments and comparisons across multiple datasets and backbone models, demonstrating consistent and substantia
The paper’s presentation could be improved — some sections are densely formatted and may benefit from clearer visual structure (e.g., spacing, figure placement, and paragraph organization) to enhance overall readability.
- The paper cleanly shows how chain‑rule recursion creates sparse/vanishing gradients and large memory graphs and contrasts chain vs. step optimization. The motivation is clear and reasonable. - The proposed method seems to be a simple, general, and effective idea: The per‑step objective with `sg(·)` is easy to implement on standard motion diffusers. - Results span two datasets and several backbones; on HumanML3D, EasyTune improves R‑Precision/FID/MM‑Dist vs. DRaFT/AlignProp/DRTune while using l
I am a bit worried about the technical novelty (but it seems ok: The fine-tuning input x_t can be obtained by denoising sampling - seems new). But Theorem 1 is a direct chain‑rule decomposition (can it be called a Theorem?); Theorem 2 is the local derivative after inserting `sg(·)`. Both are too straightforward to be called Theorem... Besides, SPL improvements are modest... On HumanML3D retrieval, SPL improves ReAlign by R@1 +2.5% / R@3 +1.4%; on KIT‑ML, R@5 +2.2%. In the Limitation section, th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
