MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation
Bizhu Wu, Jinheng Xie, Wenting Chen, Zhe Kong, Jianfeng Ren, Linlin Shen, Ruibin Bai, Rong Qu

TL;DR
MotionMERGE introduces a unified framework that enhances fine-grained, language-guided human motion understanding, editing, and reasoning by modeling at part and temporal levels, supported by a large annotated dataset.
Contribution
It pioneers fine-grained motion control with explicit part and temporal modeling, and proposes a novel pre-training strategy for cross-granularity alignment and reasoning.
Findings
Demonstrates improved precision in motion generation and editing.
Achieves strong zero-shot performance on complex motion tasks.
Establishes a new benchmark with the MotionFineEdit dataset.
Abstract
Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
