Human-Object Interaction from Human-Level Instructions
Zhen Wu, Jiaman Li, Pei Xu, C. Karen Liu

TL;DR
This paper presents a comprehensive system that interprets human instructions to generate realistic, physically plausible human-object interactions, including detailed finger movements, for complex environments using language models and reinforcement learning.
Contribution
It introduces the first system capable of synthesizing detailed, long-horizon human-object interactions driven by instructions, integrating language understanding with physics-based motion generation.
Findings
Successfully generates realistic human-object interactions in complex environments
Capable of detailed finger-object interaction synthesis
Ensures physical plausibility through reinforcement learning
Abstract
Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition
