IKMo: Image-Keyframed Motion Generation with Trajectory-Pose Conditioned Motion Diffusion Model
Yang Zhao, Yan Zhang, Xubo Yang

TL;DR
IKMo introduces a decoupled, diffusion-based human motion generation method utilizing trajectory and pose inputs, enhanced by a two-stage conditioning framework and MLLM-based pre-processing, achieving superior fidelity and controllability.
Contribution
The paper presents IKMo, a novel motion generation approach that decouples trajectory and pose inputs with a two-stage conditioning framework and integrates MLLM-based agents for improved control and realism.
Findings
Outperforms state-of-the-art on HumanML3D and KIT-ML datasets.
MLLM-based agents improve user satisfaction and motion alignment.
Enhanced motion fidelity and controllability demonstrated.
Abstract
Existing human motion generation methods with trajectory and pose inputs operate global processing on both modalities, leading to suboptimal outputs. In this paper, we propose IKMo, an image-keyframed motion generation method based on the diffusion model with trajectory and pose being decoupled. The trajectory and pose inputs go through a two-stage conditioning framework. In the first stage, the dedicated optimization module is applied to refine inputs. In the second stage, trajectory and pose are encoded via a Trajectory Encoder and a Pose Encoder in parallel. Then, motion with high spatial and semantic fidelity is guided by a motion ControlNet, which processes the fused trajectory and pose data. Experiment results based on HumanML3D and KIT-ML datasets demonstrate that the proposed method outperforms state-of-the-art on all metrics under trajectory-keyframe constraints. In addition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
MethodsDiffusion
