HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization
Roey Ron, Guy Tevet, Haim Sawdayee, Amit H. Bermano

TL;DR
HOIDiNi is a diffusion-based framework that synthesizes realistic human-object interactions from text prompts, balancing contact accuracy and motion naturalness through a novel noise optimization approach.
Contribution
It introduces Diffusion Noise Optimization (DNO) for HOI synthesis, separating object contact planning from human motion refinement for improved realism and control.
Findings
Outperforms prior methods in contact accuracy and realism
Generates complex interactions like grasping and placement from text
Achieves high-quality, controllable HOI synthesis
Abstract
We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The video has demonstrated high-quality hand-object maniplation motions. And a clear improvement over Imos and chois for hand-object grasping is shown. 2. HOIDiNi is guided by textual prompts, making it versatile for various applications where user input is essential. 3. A user study indicates a strong preference for the motions generated by HOIDiNi over competing methods.
1. As the paper cites bimArt, a paper with the same idea should also be cited “ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion, TPAMI 2025”. 2. As the main results are hand-object grasping, it would be better to compare with some hand-only object manipulation papers for the quality of hand-object contact. 3. The paper mainly compares IMos and Chois. However, although the paper is titled “human-object interaction”, the video
1. The two-phase approach is well-motivated, as it first generates object-centric motion and then fixes and conditions it to produce human-centric motion. 2. It predicts human–object contact for each frame, unlike existing methods that rely on heuristics. 3. The results appear realistic in the visual comparisons, clearly outperforming other baselines. 4. The graph in Figure 7 is impressive and supports the validity of the two-phase approach. 5. The condition-matching results in Table 2 achie
1. Figure 1 looks visually appealing but lacks essential information, such as the weaknesses of previous work, specifically, the trade-off between realistic motion and accurate contact modeling. 2. In Table 1, the penetration metric performs worse than IMOS, likely because IMOS generates motions with fewer contact instances, which also reduces motion diversity. What do you think of it? 3. There is semantic misalignment between the object and the mouth in the generated motion for the prompt “dr
Good motivation: The trade-off between realism and accuracy is the issue in previous methods and needs to be tackled to achieve the successful HOI motion generation. Some ideas are novel: Dividing the optimization into human-centric and object-centric phases seems like an interesting idea.
- Some important references are missing while comparing their method only with two baselines (ie. IMoS and CHOSIS). Authors need to include them in the text. Also, I think authors need to empirically compare their baseline with ChainHOI. ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation, CVPR’25. Himo: A new benchmark for full-body humans interacting with multiple objects, ECCV’24. - Furthermore, for hand-only scenario, below references are missing: Text2H
- Separating the optimization into two phases (object -> human) significantly improves stability and realism, effectively addressing a key limitation of prior single-step methods. - The model explicitly predicts semantic contact pairs rather than relying on heuristic nearest-neighbor matching, resulting in more consistent and physically meaningful grasps. - The quantitative and qualitative results for second phase (whole body generation ) are both strong; in particular, from videos and images
- The paper primarily employs two datasets, but to my knowledge, OMOMO does't include hand motions. Therefore, what is the purpose of using Contact-Paris here? Additionally, I did not observe any visualization results on OMOMO. - There are no quantitative or qualitative evaluations provided for the first (object-centric) phase to assess the quality of the generated object trajectories, accuracy of contact prediction and penetration score. - The motion diversity and scale of the GRAB dataset are
1. The proposed two-phase HOI generation framework is appealing and effective. It disentangles the HOI generation tasks, which involves synthesis of both human and object motion as well as ensuring their coherent interactions, into simpler tasks. 2. The visual results of generated HOIs are of high quality. Although there are still artifacts, the quality is generally higher than other baseline models. 3. The code and model checkpoints is promised to be released.
1. Using DNO to improve the contact between hands and objects is a critical component and contribution in this paper. However, except for the brief introduction of how DNO works in general in Section 3, it is never explained in detail how it is integrated into HOI. For instance, for the equation defined around line #187, what does $x_T$ correspond to in HOI generation? What is the definition of $R_{decor}$? Furthermore, DNO itself is not entirely novel. It is expected to see insights why using i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety · Evacuation and Crowd Dynamics
