OmniControl: Control Any Joint at Any Time for Human Motion Generation
Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, Huaizu Jiang

TL;DR
OmniControl introduces a unified diffusion-based model that enables flexible, joint-specific spatial control in human motion generation, improving realism and adherence to constraints over previous pelvis-only control methods.
Contribution
The paper presents a novel analytic spatial guidance method combined with realism guidance, allowing control over multiple joints at different times within a single model.
Findings
Significant improvement in pelvis control accuracy.
Effective incorporation of multi-joint spatial constraints.
Enhanced motion realism and coherence.
Abstract
We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals. At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion. Both the spatial and realism guidance are essential and they are highly complementary for balancing control accuracy and motion realism. By combining them, OmniControl generates motions that are realistic, coherent, and consistent with the spatial constraints. Experiments on…
Peer Reviews
Decision·ICLR 2024 poster
* (1) The method is the first to control any joint at any time for human motion synthesis, which can improve the flexibility of motion generation tasks and potentially benefit downstream applications such as generating human motion on different terrains. * (2) The method design is clear and reasonable. * (3) Experiments demonstrate the effectiveness of the proposed method. * (4) The analysis for method ablations is solid. * (5) The paper is well-organized and easy to follow.
* (1) The inference speed for the proposed method is much lower than baselines, which could potentially impede the method to apply to a large amount of data. * (2) In the third column of Figure 1, the authors show that the method can support a combination of control signals from different joints. However, the paper lacks quantitative analysis to further examine its performance.
The paper offers a simple yet effective method to integrate spatial control signals into a text-conditioned human motion generation model based on the diffusion process. The introduction of realism guidance to refine all joints for generating more coherent motion is commendable. The evaluation is adequate and comprehensive.
It would be better if the difference and advantage between the global coordinates and local coordinates could be visualized. Inference time is higher than MDM and GMD. The concept in Fig. 4, such as the input process, requires further clarification for better comprehension. In addition, The components of spatial encoder F and the size of output f_n are not explained. The difference and advantage between the global coordinates and local coordinates were not visually explained.
- The method allows using spatial guidance to constraint the generated motion sequence. The constraint can be put to any joint instead of just pelvis. - By combining the realism guidance, the conditionally generated motion sequences can be expected to be more natural under the spatial constraint. The realism guidance is a trainable copy of encoder to enforce the spatial constraints. It is essentially an enforced encoder fusing the information from the language prompt and the spatial constraints.
- The paper writing is not fluent enough and needs polishing to be easier to follow. - Given the carefully designed modules, the time efficiency for training is important to evaluate the significance of the proposed method. However, this part is missing in the paper. - Some important baselines are missing in the experiment sections, such as [1,2]. Adding the full set of published baselines on the benchmarks of HumanML3D and KIT-ML will change the position of the proposed methods highly. Can the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Stroke Rehabilitation and Recovery
MethodsDiffusion
