OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model

Zhenhao Zhang; Ye Shi; Lingxiao Yang; Suting Ni; Qi Ye; Jingya Wang

arXiv:2505.18947·cs.CV·December 23, 2025

OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model

Zhenhao Zhang, Ye Shi, Lingxiao Yang, Suting Ni, Qi Ye, Jingya Wang

PDF

Open Access 1 Video

TL;DR

OpenHOI is a pioneering framework that synthesizes realistic 3D hand-object interactions in open-world scenarios, guided by natural language commands, leveraging multimodal large language models and physics-based refinement.

Contribution

It introduces the first open-world HOI synthesis method integrating multimodal LLMs with physics-based refinement for generalization to unseen objects and complex instructions.

Findings

01

Outperforms state-of-the-art in generalization to novel objects

02

Handles multi-stage tasks and complex language instructions

03

Produces physically plausible and precise interactions

Abstract

Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., "Find a water bottle and take a sip") into executable sub-tasks. To synthesize physically plausible interactions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multimodal Machine Learning Applications

MethodsDiffusion