Unleashing Guidance Without Classifiers for Human-Object Interaction Animation
Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui

TL;DR
This paper introduces LIGHT, a novel diffusion-based method for human-object interaction animation that eliminates the need for handcrafted contact priors by using denoising pace as guidance, leading to more realistic and contact-aware animations.
Contribution
LIGHT presents a data-driven guidance approach in diffusion models for HOI animation, reducing reliance on manual priors and improving contact fidelity and generalization.
Findings
Outperforms classifier-free guidance in contact fidelity.
Achieves more realistic and diverse HOI animations.
Generalizes well to unseen objects and tasks.
Abstract
Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad…
Peer Reviews
Decision·ICLR 2026 Poster
+ The proposed pace-induced guidance and contact-aware shape-spectrum augmentation are novel. Specifically, (1) The pace-induced guidance is proven to provide a data-driven altenative, which is even more effective than the priors used in previous methods. (2) The augmentation builds a informative invariance directly into the training data, leading to improved generatlization. + The method demonstrates clear quantitative and qualitative improvements over strong baselines (HOI-Diff, CHOIS, InterD
- The method is currently designed and evaluated for interacting with only one single object. Interactions with multiple and complex objects would be more beneficial. - Comparisons with zero-shot HOI generation methods, such as InterDreamer and ZeroHSI, may also be useful. - Minor issues: The inference process requires around 72 seconds, which is higher than HOI-DIff and InterDiff (non-guided baselines).
- The idea of pace-induced guidance (asynchronous denoising between modalities) is innovative, extending diffusion forcing into a practical HOI setting - The quantitative results are comprehensive, with thorough comparisons against existing baselines and well-conducted ablation studies.
- From the final visual results, the proposed strategy indeed demonstrates the ability to effectively leverage human priors to generate plausible interaction poses. However, the method still struggles with fine-grained contact modeling, and noticeable artifacts remain at the contact level. - The evaluation metrics seem somewhat questionable — the R-Precision scores are all within a similar range, and several other metrics also show minor differences across methods. It is unclear whether these m
1. It proposes a pace-induced guidance mechanism to generate more realistic and plausible human-object interactions. 2. Through extensive experiments, it analyzes the effects of pace-induced guidance, token separation, augmented data, guidance intensity and denoising lag, and guidance direction. 3. The authors also provide fair comparisons by re-implementing and modifying prior baselines (e.g., InterDiff, Text2HOI).
1. There is no reference to Figure 5 in the section Impact of Guidance Intensity and Denoising Lagging. Please explicitly link the analysis to the figure. 2. In the Impact of Guidance Intensity and Denoising Lagging experiment, the paper states that the best value of δ is 300, but Figure 5 seems to suggest that 200 performs best. Could the authors clarify which value is correct? 3. I am not fully clear about the settings in Table 2. For the case where hand-body separation is ✓ and human-object
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Robot Manipulation and Learning · Social Robot Interaction and HRI
