AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation
Yukang Cao, Liang Pan, Kai Han, Kwan-Yee K. Wong, Ziwei Liu

TL;DR
AvatarGO is a novel zero-shot framework that generates and animates 4D human-object interaction scenes directly from text, overcoming data scarcity and improving realism and robustness in animation.
Contribution
It introduces LLM-guided contact retargeting and correspondence-aware motion optimization for zero-shot 4D HOI scene generation from text prompts.
Findings
Outperforms existing methods in generating coherent human-object interactions.
Handles penetration issues more effectively.
Demonstrates robustness across diverse poses and object pairs.
Abstract
Recent advancements in diffusion models have led to significant improvements in the generation and animation of 4D full-body human-object interactions (HOI). Nevertheless, existing methods primarily focus on SMPL-based motion generation, which is limited by the scarcity of realistic large-scale interaction data. This constraint affects their ability to create everyday HOI scenes. This paper addresses this challenge using a zero-shot approach with a pre-trained diffusion model. Despite this potential, achieving our goals is difficult due to the diffusion model's lack of understanding of ''where'' and ''how'' objects interact with the human body. To tackle these issues, we introduce AvatarGO, a novel framework designed to generate animatable 4D HOI scenes directly from textual inputs. Specifically, 1) for the ''where'' challenge, we propose LLM-guided contact retargeting, which employs…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper tackles the new task of zero-shot generation of 4d human-object interaction (HOI). - The proposed LLM-guided contact retargeting helps initialize the object in reasonable 3D locations to help generate plausible HOI. - The correspondence-aware motion optimization helps prevent penetration and maintain plausible interactions throughout the animation. - Visually, the method outperforms previous methods in the quality of generated HOI. The higher CLIP score and user preference also suppo
- The overall appearance quality of generated humans and objects is still limited. The results are often blurry and lack realism. There is also still human-object penetration (see “Naruto in Naruto Series stepping on a football under his foot” on the webpage). - The notations in object animation and correspondence-aware motion optimization are poorly explained. What is G_c and G_o? How are they computed? The paper mentioned they are derived from x_ci and x_oi, but it’s unclear. - It seems the me
1. The qualitative results of the proposed method seem more plausible than the compared methods in Figure 3. Moreover, the authors conduct user studies to compare the proposed method with comparable methods in diverse criteria, e.g., penetration, motion quality, and overall performance. 2. The paper is well-structured, with clear sections from introduction to methodology and experiments. The authors provide detailed explanations about their task with various citations, which helps the reviewer
1. The reviewer wonders if the human motion is given as input or is generated by the proposed method. In Figure 2 caption (L185), the authors explain that correspondence-aware motion optimization jointly optimizes human and object animation, but in L345, it is written that the motion sequence is given. The reviewer requests clear explanations of what are the inputs and outputs of the proposed pipeline. 2. The reviewer thinks that the proposed method's novelty is limited. The objective function
1. The paper is well-written, with a clear motivation and an easy-to-follow presentation of the core idea. Related work is thoroughly covered, incorporating up-to-date research in the field. 2. Extensive experiments and ablation studies convincingly demonstrate the effectiveness of AvatarGO, showcasing appropriate human-object interactions and superior robustness against penetration issues.
I believe this paper is ready for publication, as I found no major weaknesses. My only question concerns the diversity of objects included in the text prompts used in the experiments. Additionally, how does your method perform across different types of objects?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Virtual Reality Applications and Impacts
MethodsFocus · Diffusion
