LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

Haoru Xue; Xiaoyu Huang; Dantong Niu; Qiayuan Liao; Thomas Kragerud; Jan Tommy Gravdahl; Xue Bin Peng; Guanya Shi; Trevor Darrell; Koushil Sreenath; Shankar Sastry

arXiv:2506.13751·cs.RO·September 26, 2025

LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, Shankar Sastry

PDF

3 Reviews

TL;DR

LeVERB introduces a hierarchical framework for humanoid whole-body control using vision-language instructions, enabling zero-shot task success and bridging the gap between semantic understanding and dynamic control in robotics.

Contribution

It presents the first sim-to-real benchmark for humanoid vision-language control and a novel hierarchical latent instruction-following framework called LeVERB.

Findings

01

Achieves 80% success in simple visual navigation tasks.

02

Attains 58.5% success rate overall, outperforming naive methods.

03

Demonstrates effective zero-shot generalization in humanoid control.

Abstract

Vision-language-action (VLA) models have demonstrated strong semantic understanding and zero-shot generalization, yet most existing systems assume an accurate low-level controller with hand-crafted action "vocabulary" such as end-effector pose or root velocity. This assumption confines prior work to quasi-static tasks and precludes the agile, whole-body behaviors required by humanoid whole-body control (WBC) tasks. To capture this gap in the literature, we start by introducing the first sim-to-real-ready, vision-language, closed-loop benchmark for humanoid WBC, comprising over 150 tasks from 10 categories. We then propose LeVERB: Latent Vision-Language-Encoded Robot Behavior, a hierarchical latent instruction-following framework for humanoid vision-language WBC, the first of its kind. At the top level, a vision-language policy learns a latent action vocabulary from synthetically…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Proposes a **modular, hierarchical structure** (System 1 and System 2) that could improve inference efficiency and decouple vision-language reasoning from dynamics control. - Introduces a **sim-to-real-ready benchmark** with photorealistic rendering and procedural scene randomization. - Demonstrates **zero-shot transfer** of some whole-body behaviors from simulation to real-world deployment.

Weaknesses

- Limited Benchmark Diversity: Although the paper claims to introduce a comprehensive benchmark for humanoid WBC, the majority of tasks are navigation-like—e.g., “walk to”, “navigate around”, or “reach”. These are essentially vision-language navigation tasks, not complex whole-body manipulation. Hence, the benchmark provides limited insights into the method’s generality for rich human-object interactions (e.g., picking up, placing, or coordinated manipulation). - Restricted Demonstration Source

Reviewer 02Rating 6Confidence 4

Strengths

## Strengths - Sim‑to‑real benchmark : Synthetic dataset of 150+ tasks (154 trajectories × 100 augmentations ≈ 17 h) grounded in photorealistic scenes and diversified camera views. - A hierarchical system 2 keeps vision–language inference off the real-time control loop. There by modularizing control between systems, which is clean. - Comprehensive ablations (no discriminator, no kinematics encoder, latent sampling, etc.) clearly justify architectural choices. - Zero-shot transfer from Isaac

Weaknesses

## Weaknesses - Sim-to-Real evidence : purely qualitative, no task-level success rate, latency, or contact statistics; real-world failure cases unreported. - Missing strong baselines : recent controllers such as ExBody 2, OmniH20 and world-model planner Nicklas Puppeteer are not compared. - Portability unclear : System‑1 is specialized to a single Unitree G1 morphology. modularity claim would benefit from multi‑platform evidence. - Scalability of latent vocabulary unclear : The latent space

Reviewer 03Rating 4Confidence 3

Strengths

1. While hierarchical VLA models and humanoid control exist separately, this is the first work to formulate and demonstrate vision-language-driven whole-body control for humanoids using a latent interface. 2. The learned latent verb vocabulary directly removes the key limitation of prior hierarchical VLAs, which was their reliance on an inflexible, hand-crafted "action vocabulary" (e.g., base velocities).

Weaknesses

1. The benchmark and experiments focus almost exclusively on locomotion and posture (navigation, sitting), omitting manipulation tasks (e.g., picking, pushing) 2. The celebrated "zero-shot" real-world deployment uses open-loop replay of latent plans generated in simulation. The high-level vision-language policy does not run closed-loop on the real robot, weakening the claim of a fully closed-loop system. 3. Lacks comparison to a strong, non-latent baseline (e.g., a hierarchical VLA that predicts

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.