TL;DR
MotionRFT introduces a reinforcement fine-tuning framework with a unified semantic reward and efficient step-wise optimization, significantly improving text-to-motion generation quality and efficiency.
Contribution
It proposes a novel multi-dimensional reward model and a fine-grained, memory-efficient fine-tuning method for better alignment in text-to-motion models.
Findings
Achieved FID of 0.132 with 22.10 GB memory on MLD model.
Saved up to 15.22 GB memory compared to DRaFT.
Improved FID and R-Precision metrics on multiple motion datasets.
Abstract
Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
