Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model
Yuduo Jin, Brandon Haworth

TL;DR
This paper systematically investigates how different motion representations and training configurations affect the performance and efficiency of human motion diffusion models, providing insights for better model design.
Contribution
It offers a comprehensive empirical analysis of motion representations, training time, and configurations in diffusion-based human motion synthesis, which was previously underexplored.
Findings
Different motion representations significantly impact quality and diversity.
Training configurations influence model efficiency and outcomes.
Empirical results highlight the importance of design choices in motion diffusion models.
Abstract
Diffusion models have emerged as a widely utilized and successful methodology in human motion synthesis. Task-oriented diffusion models have significantly advanced action-to-motion, text-to-motion, and audio-to-motion applications. In this paper, we investigate fundamental questions regarding motion representations and loss functions in a controlled study, and we enumerate the impacts of various decisions in the workflow of the generative motion diffusion model. To answer these questions, we conduct empirical studies based on a proxy motion diffusion model (MDM). We apply v loss as the prediction objective on MDM (vMDM), where v is the weighted sum of motion data and noise. We aim to enhance the understanding of latent data distributions and provide a foundation for improving the state of conditional motion diffusion models. First, we evaluate the six common motion representations in…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clean ablation of six motion representations within one diffusion backbone (MDM/vMDM) with standard metrics (FID/KID/precision/recall/diversity). 2. Clear empirical takeaway in their setup: under vMDM, JP outperforms rotation-based reps and is faster to train. 3. Practical notes on training/inference that practitioners can immediately try.
1. The general question “which motion representation is easier to learn for diffusion models” is not new. For example, MARDM [1] discusses redundant motion representations for training VQ-based vs. diffusion-based models. MotionStreamer [2] proposes a 272-D motion representation to remove post-processing that is required for animation. InterGen [3] introduces a representation tailored for two-person interactions. ACMDM [4] shows that absolute/global joint coordinates improve motion fidelity and
1. Modern advancements in the motion generation field are mostly at the architectural level; however, representation itself is the fundamental problem as well as the diffusion objective. As a researcher in this field, I agree with the author's motivation for the paper. 2. Various comparisons across six representations are conducted on a controlled architecture, and the results follow intuition and motivation.
1. Though I admire the motivation and respect the authors’ effort to explore representation-level questions in motion generation. As a researcher in this field, **I sincerely believe it is also important to acknowledge prior work that has already addressed many of these issues**. Several earlier papers have studied these choices and reported similar or stronger findings before this submission. MLD, MotionLCM, and MotionStreamer have shown that latent spaces are a much better representation to m
1. The manuscript tackles the impact of motion representation on diffusion-based motion generation—a meaningful question that can inform and inspire subsequent research. 2. The manuscript provides thorough quantitative and qualitative analyses across multiple motion representations, offering clear empirical evidence to support the study’s conclusions.
1. **Limited methodological breadth**. Experiments are confined to MDM, without evaluating other motion-generation methods (e.g., VAE- or autoregressive-based models, or more recent architectures). This narrow scope limits the generality and external validity of the conclusions. 2. **No study of combined representations**. Each representation is trained and tested in isolation. As noted by the authors, prior work and datasets (e.g., HumanML3D [1]) often combine permutations of joint positions an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Ergonomics and Musculoskeletal Disorders · Social Robot Interaction and HRI
