Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis
Seong-Eun Hong, JaeYoung Seon, JuYeong Hwang, JongHwan Shin, HyeongYeop Kang

TL;DR
This paper introduces Event-T2M, a diffusion-based framework that decomposes complex multi-action text prompts into semantically self-contained events, improving the naturalness and order of generated motions, especially for multi-event prompts.
Contribution
It defines the concept of an event in text-to-motion synthesis, proposes a novel event-based conditioning method, and introduces a new benchmark for multi-event prompts to evaluate model performance.
Findings
Event-T2M outperforms baselines on multi-event prompts.
The new HumanML3D-E benchmark effectively stratifies prompts by event complexity.
Human studies confirm the naturalness and order preservation of generated motions.
Abstract
Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we propose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we construct HumanML3D-E, the first benchmark…
Peer Reviews
Decision·ICLR 2026 Poster
- The point of modeling the motion complexity by the number of events is straightforward and reasonable. The proposed HumanML3D-E benchmark will be beneficial to the community, which can evaluate motion generation frameworks on more detailed levels of complexity. - The experimental analysis of different methods on different event counts supports the motivation of the proposed event-based benchmark. - The design of the event-based cross-attention module is reasonable and validated by ablation st
- The events of a motion are divided by an LLM with text input only. The label may contain errors. Manually validating the labels or sampling cases to check the accuracy rate of the LLM labels will be beneficial. - The paper misses some comparisons with some recent stronger baselines, e.g., MoGenTS (NeurIPS 2024), MARDM (CVPR 2025), and LAMP (ICLR 2025). - The event-based benchmark only contains one dataset, HumanML3D. It's better to add more datasets, e.g., KIT-ML, Motion-X, to better validate
- The problem significance is huge. Generating complex and consistent human motions is an unsolved challenge in the T2M field. - This paper proposes a novel benchmark called HumanML3D-E. This is the first benchmark stratified by the "event complexity" of the prompts. It provides a very valuable evaluation tool for future research on long and complex T2M generation field. - The idea of decompose the complex motions is very intuitive and logical.
- **Unfair Comparison**: The authors' new benchmark, HumanML3D-E, is constructed using an LLM and a specific "event-aware prompt." However, the proposed model, Event-T2M, **also relies on the exact same LLM and the exact same prompt** in its data preprocessing stage. Event-T2M is evaluated on a test set that is perfectly aligned with its own training and inference pipeline. In contrast, all baseline models are evaluated without using this LLM-based event decomposition preprocessing. This constit
1.Proposes an event-based paradigm for motion generation. 2.Constructs the first event-level motion generation dataset.
1.Does event-driven motion generation offer advantages over action-driven or hybrid (action + event) methods? 2.Does the proposed method outperform approaches that enhance motion quality through motion retrieval? 3.In TMR, innovation based solely on input differences does not constitute true novelty. 4.LIMM, ATII, and ECA follow common module design patterns and lack sufficient originality.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
