HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models
Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, Huaizu Jiang

TL;DR
This paper introduces HOI-Diff, a modular diffusion model framework that synthesizes realistic 3D human-object interactions from text prompts, incorporating motion generation, contact prediction, and contact-aware guidance.
Contribution
The paper presents a novel modular diffusion-based approach with a dual-branch motion model and an independent affordance prediction model for text-driven 3D HOI synthesis.
Findings
Produces realistic 3D HOIs with diverse interactions
Achieves accurate contact between humans and objects
Demonstrates effectiveness on BEHAVE and OMOMO datasets
Abstract
We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The topic of text-driven HOI synthesis is of large importance in the community; 2. The annotation on BEHAVE is beneficial for the community; 3. Experimental results show the effectiveness of the proposed method; 4. The paper is well written and organized.
1. The motivation of decomposing HOI synthesis into simpler sub-tasks is not clearly elaborated, which should be discussed. Why the model generate a roughly coherent HOI motion at first rather than directly generate a high quality HOI motion? 2. The authors claim that the text-driven 3D HOI synthesis with a diverse set of interactions is under-explored and existing datasets lack either HOIs or textual descriptions. However, many works have been proposed for this task such as CG-HOI and CHOIS. Th
Here I highlight the strengths of the paper: 1. The modular design that starts with a coarse grain and refines to a fine grain is valuable for human-object interaction generation. 2. The ablation study is sufficient to prove the effectiveness of the proposed method. 3. The presentation of this paper is clear, especially the pipeline of the proposed method.
Although the coarse-to-fine modular design is valuable, this paper only considers the contact between body parts and objects. This method neglects the important contact between hand and object in human-object interaction. Here, I highlight my concerns. 1. Lack of citations and discussion of related work. Most of the methods cited and compared are those before 2023. There are a lot of related works that appear in 2024, such as [1,2,...]. These new works should be discussed and compared. [1]CG-H
The use of a dual-branch diffusion model combined with an affordance prediction model allows for accurate and diverse generation of human-object interactions, addressing the complexity of such tasks effectively. The model demonstrates superior performance on standard metrics against baseline models, particularly in generating interactions with unseen objects, which showcases its robustness and ability to generalize across different scenarios.
The model's performance heavily relies on the quality of datasets like BEHAVE and OMOMO. If these datasets are limited or biased towards specific types of interactions, the model's generalization ability may be compromised. I request authors to test the model with completely unseen objects not included in the BEHAVE or OMOMO datasets. For example, using a text-to-3D related model [1] to create a new object and showing inference results for that object. [1] DreamFusion: Text-to-3D using 2D Diffus
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Image Processing and 3D Reconstruction
MethodsDiffusion · classifier-guidance
