Human-Object Interaction from Human-Level Instructions

Zhen Wu; Jiaman Li; Pei Xu; C. Karen Liu

arXiv:2406.17840·cs.AI·August 22, 2025·1 cites

Human-Object Interaction from Human-Level Instructions

Zhen Wu, Jiaman Li, Pei Xu, C. Karen Liu

PDF

Open Access

TL;DR

This paper presents a comprehensive system that interprets human instructions to generate realistic, physically plausible human-object interactions, including detailed finger movements, for complex environments using language models and reinforcement learning.

Contribution

It introduces the first system capable of synthesizing detailed, long-horizon human-object interactions driven by instructions, integrating language understanding with physics-based motion generation.

Findings

01

Successfully generates realistic human-object interactions in complex environments

02

Capable of detailed finger-object interaction synthesis

03

Ensures physical plausibility through reinforcement learning

Abstract

Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition