TL;DR
HOI-PAGE introduces a part-level affordance reasoning approach using LLMs to generate realistic zero-shot 4D human-object interactions from text prompts, emphasizing part mechanics over whole-body motion.
Contribution
It presents a structured part affordance graph guiding a three-stage process for high-fidelity, zero-shot 4D HOI generation focusing on part-level mechanics and constraints.
Findings
Generated complex multi-object and multi-person interactions.
Achieved significantly improved realism and text alignment.
Demonstrated flexibility in zero-shot 4D HOI synthesis.
Abstract
We present HOI-PAGE, a new approach that prioritizes part-level affordance reasoning to generate high-fidelity 4D human-object interactions (HOIs) from text prompts in a zero-shot fashion. In contrast to prior works that focus on global, whole body-object motion synthesis, our approach explicitly reasons about the underlying part-level mechanics of interactions using large language models (LLMs). We capture this reasoning in a structured part affordance graph (PAG) representation, serving as a high-level interaction scaffolding to guide a three-stage synthesis: first, decomposing input 3D objects into semantic parts; then, generating reference HOI videos from text prompts to extract part-based motion constraints; and finally, optimizing for 4D HOI motion sequences that mimic the reference dynamics while satisfying part-level contact constraints. Extensive experiments show that our…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Learning object affordance is a challenging task. The idea of using LLMs to construct part affordance graphs (PAGs) is an appealing idea to leverage the semantic prior of a foundation model. The structured representation in a PAG is interpretable, compositional, and may be helpful for other applications (e.g., in robotics) as well. 2. The proposed zero-shot HOI generation framework is appealing as the motion capture data of this task is inherently limited. Using the priors of a video diffus
1. It is not clear how the generated 4D HOI videos may follow the prompts of objects and the affordance. - The structured representations of PAGs are converted into textual prompts, which are fed into a image and video diffusion model. However, it is possible that the generated motion may not follow the desired interactions between human and objects. - It lacks details on how to condition the generated videos on the given 3D instances. 2. The generated PAGs may lack diversity to captur
1. The paper presents a novel perspective on zero-shot HOI generation. By leveraging Large Language Models (LLMs) to distill structured affordance knowledge , it effectively bypasses the need for limited 4D training data and achieves strong generalization. 2. The proposed Part Affordance Guidance (PAG) enables finer-grained control over the synthesis process. This explicit, part-level reasoning ensures more realistic interactions and accurate contact dynamics between specific human body parts an
1. The pipeline is heavily dependent on a complex cascade of pre-trained models (e.g., LLMs, T2V, segmentation). The framework's quality is bottlenecked by these components, and a failure in an intermediate step, such as poor video generation or segmentation, can cause the entire result to fail. 2. The physical realism of the final motion is not guaranteed. The system is optimized to fit a *generated reference video*, which itself may lack physical plausibility. Furthermore, the decoupled optimi
+ The three-stage approach is logically sound. The results produced by HOI-PAGE indeed present improvements over existing methods, in terms of penetration. + The paper is well-written and easy to follow.
- Limited novelty. (1) The part-level affordance map is not novel. Existing methods have explored the usage of contact graph, affordance map, and video generation, etc. (2) The usage of LLMs is also not novel, which may limit the performance of the 4D generation as well. - Limited granularity of motions. The model cannot address the penetration issue. Additionally, the generated motion may still present unwanted instability, especially for the objects. - More evaluations may be needed. (1) Com
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · 3D Shape Modeling and Analysis · Robot Manipulation and Learning
MethodsFocus
