Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World
Yingzhao Jian, Zhongan Wang, Yi Yang, Hehe Fan

TL;DR
This paper introduces BiBo, a method that leverages off-the-shelf GPT-4 and vision-language models to control humanoid robots, enabling complex interactions and motions without extensive data collection.
Contribution
BiBo is a novel framework combining an embodied instruction compiler and a diffusion-based motion executor to empower VLMs for humanoid control in open environments.
Findings
Achieves 90.2% success rate in open environment interactions
Improves motion execution precision by 16.3% over previous methods
Demonstrates effective handling of diverse and complex motions
Abstract
Humanoid agents often struggle to handle flexible and diverse interactions in open environments. A common solution is to collect massive datasets to train a highly capable model, but this approach can be prohibitively expensive. In this paper, we explore an alternative solution: empowering off-the-shelf Vision-Language Models (VLMs, such as GPT-4) to control humanoid agents, thereby leveraging their strong open-world generalization to mitigate the need for extensive data collection. To this end, we present \textbf{BiBo} (\textbf{B}uilding humano\textbf{I}d agent \textbf{B}y \textbf{O}ff-the-shelf VLMs). It consists of two key components: (1) an \textbf{embodied instruction compiler}, which enables the VLM to perceive the environment and precisely translate high-level user instructions (e.g., {\small\itshape ``have a rest''}) into low-level primitive commands with control parameters…
Peer Reviews
Decision·ICLR 2026 Poster
* The idea of utlizing the strong capability of VLM to decompose the primitive commands and later handled by a motion generator is interesting. * The experiments show the proposed algorihtm obtains promising results in the challenging problem of interaction with the open environments. * The paper is well presented and the proposed algorithm should be easy to reproduce.
* The paper relies on two components to handle the target problem. On one hand, currently, even the SOTA vlms may not be able to produce the precise primitive actions. To simply the problem, the paper presents a set of predefined actions but it still cannot guarantee a robust results. On the other hand, assume vlms can produce the accurate action motions, how to obtain a good motion is not a trivial task. It should provide more justification that why the presented motion executor can produce the
1. The idea of directly plugging an off-the-shelf VLM (GPT-4o) into a humanoid control pipeline is innovative. Avoids re-training large models by adding a lightweight compiler layer. 2. The compiler–assembler analogy is clear and intuitive: the VLM acts like a “compiler” converting high-level language into structured commands, while the motion diffusion module serves as an “assembler” for physical actuation. 3. The Latent Diffusion Model with joint decoding of executed and generated latents ensu
1. The motions shown in the video do not fully comply with physical laws. During interactions with objects, there are visible cases of hovering and penetration, which make it appear that the human keypoints are rule-based attached to the objects rather than physically constrained. The interactivity seems weaker compared with methods such as UniHSI. 2. The motion generation in the video appears to heavily depend on the VLM’s outputs. However, the VLM tends to exhibit strong hallucination problems
- VLM Agent Workflow for Complex Task Understanding The Embodied Instruction Compiler is well-designed, using a structured three-step reasoning process (attribute analysis, pose reasoning, and joint generation) to translate high-level commands into low-level motor instructions. This design allows BiBo to accurately interpret user intent and adapt to complex tasks in dynamic physical environments, such as sitting, lifting objects, or interacting with multiple scene elements. The use of voting me
- Unclear Execution of Motion with CLoSD for Dynamic Objects The paper lacks clarity on how the generated motion trajectories are passed to CLoSD for execution. For instance, when an object moves unpredictably, does the system rely on CLoSD alone for tracking, or does it dynamically update the motion plan using feedback? What if the dynamic object encounters collision with hands? While the authors mention incorporating physical feedback into motion updates, the explanation of how BiBo handles m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Robot Manipulation and Learning
