Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis
Jingyu Gong, Chong Zhang, Fengqi Liu, Ke Fan, Qianyu Zhou, Xin Tan, Zhizhong Zhang, Yuan Xie

TL;DR
This paper introduces Diffusion Implicit Policy (DIP), a novel framework for scene-aware motion synthesis that does not require paired data, enabling more natural and interaction-plausible motions across diverse scenes.
Contribution
The paper proposes a unified, data-efficient framework that disentangles human-scene interaction from motion synthesis and employs implicit policy optimization during inference.
Findings
Outperforms existing methods in motion naturalness and interaction plausibility.
Effective in diverse scenes, including real-world environments.
Supports long-term motion synthesis with motion blending techniques.
Abstract
Scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data, while it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this paper, we disentangle human-scene interaction from motion synthesis during training, and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. For long-term motion synthesis, we introduce motion blending in joint rotation power space.…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
There are several strengths of the proposed method. First, by leveraging the implicit diffusion policy, motion synthesis is converted into an optimization problem instead of supervised learning that requires paired data of motions and scenes, which can efficiently alleviate the difficulty induced by the lack of training data, which be able to ensure the naturalness of the motion as long as the rewards are designed effectively. Second, the introduced centroid-oriented sampling distribution in the
There are also several weaknesses in the proposed. First, it does not model the interaction between the hands and the objects, which restricts the motion synthesis to scenarios where one only emphasizes correct contact of the body shape with coarse surfaces in the scene, which further limits the diversity of the synthesized motion. Second, the dependence on human motion diffusions on one hand provides human motion priors for motion synthesis, but on the other hand, can be a source of bias, so it
Optimizing the main objective alongside 6 reward functions seems very tricky, and I am impressed that this worked.
This paper seems to have logical or grammatical or formatting errors throughout, relating to how citations belong in sentences. For example it often calls specific people "works", and names and semicolons are often inside sentences where they should not be part of the prose. E.g., "Some of them Wang et al. (2021; 2022a) utilized..." -- this doesn't make sense. Maybe the authors simply pasted their text into the ICLR template and then did not check for any errors. This makes the paper much more d
+ Compared with two baseline methods, the generated motion sequence is apparently more natural and physically plausible, with less penetration with the scene and less foot skating. + The idea of using diffusion model to improve motion quality is reasonable and might be the trend.
- The effectiveness of the ControlNet structure of the motion diffusion model is questionable as there is no ablation study. The position control should be able to enforce during diffusion sampling. Why bother training a control module? - The interaction policy consists of well known constraint functions, making the contribution of this part weak. It is also debatable whether these constraint terms can be called reward, which is a taxonomy in the reinforcement learning. - From the visual results
1. This paper tackles inherent probelm of scene-aware motion synthesis of limited motion-scene paired data. 2. The proposed DIP achieves promising results in penetration and time metrics.
1. Overall writing of the paper is mostly not self-contained. This means that readers with limited background of diffusion-based motion synthesis may have difficult time comprehending the gist of the paper. Most importantly, it is difficult to understand why the method is free from motion-scene paired data. 2. There is no comparison on quality of generated motions with famous metrics such as FID or action recognition accuracy. 3. The proposed DIP fall short in contact metric as shown in Table 1
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Video Coding and Compression Technologies
MethodsDiffusion
