TL;DR
LeVERB introduces a hierarchical framework for humanoid whole-body control using vision-language instructions, enabling zero-shot task success and bridging the gap between semantic understanding and dynamic control in robotics.
Contribution
It presents the first sim-to-real benchmark for humanoid vision-language control and a novel hierarchical latent instruction-following framework called LeVERB.
Findings
Achieves 80% success in simple visual navigation tasks.
Attains 58.5% success rate overall, outperforming naive methods.
Demonstrates effective zero-shot generalization in humanoid control.
Abstract
Vision-language-action (VLA) models have demonstrated strong semantic understanding and zero-shot generalization, yet most existing systems assume an accurate low-level controller with hand-crafted action "vocabulary" such as end-effector pose or root velocity. This assumption confines prior work to quasi-static tasks and precludes the agile, whole-body behaviors required by humanoid whole-body control (WBC) tasks. To capture this gap in the literature, we start by introducing the first sim-to-real-ready, vision-language, closed-loop benchmark for humanoid WBC, comprising over 150 tasks from 10 categories. We then propose LeVERB: Latent Vision-Language-Encoded Robot Behavior, a hierarchical latent instruction-following framework for humanoid vision-language WBC, the first of its kind. At the top level, a vision-language policy learns a latent action vocabulary from synthetically…
Peer Reviews
Decision·Submitted to ICLR 2026
- Proposes a **modular, hierarchical structure** (System 1 and System 2) that could improve inference efficiency and decouple vision-language reasoning from dynamics control. - Introduces a **sim-to-real-ready benchmark** with photorealistic rendering and procedural scene randomization. - Demonstrates **zero-shot transfer** of some whole-body behaviors from simulation to real-world deployment.
- Limited Benchmark Diversity: Although the paper claims to introduce a comprehensive benchmark for humanoid WBC, the majority of tasks are navigation-like—e.g., “walk to”, “navigate around”, or “reach”. These are essentially vision-language navigation tasks, not complex whole-body manipulation. Hence, the benchmark provides limited insights into the method’s generality for rich human-object interactions (e.g., picking up, placing, or coordinated manipulation). - Restricted Demonstration Source
## Strengths - Sim‑to‑real benchmark : Synthetic dataset of 150+ tasks (154 trajectories × 100 augmentations ≈ 17 h) grounded in photorealistic scenes and diversified camera views. - A hierarchical system 2 keeps vision–language inference off the real-time control loop. There by modularizing control between systems, which is clean. - Comprehensive ablations (no discriminator, no kinematics encoder, latent sampling, etc.) clearly justify architectural choices. - Zero-shot transfer from Isaac
## Weaknesses - Sim-to-Real evidence : purely qualitative, no task-level success rate, latency, or contact statistics; real-world failure cases unreported. - Missing strong baselines : recent controllers such as ExBody 2, OmniH20 and world-model planner Nicklas Puppeteer are not compared. - Portability unclear : System‑1 is specialized to a single Unitree G1 morphology. modularity claim would benefit from multi‑platform evidence. - Scalability of latent vocabulary unclear : The latent space
1. While hierarchical VLA models and humanoid control exist separately, this is the first work to formulate and demonstrate vision-language-driven whole-body control for humanoids using a latent interface. 2. The learned latent verb vocabulary directly removes the key limitation of prior hierarchical VLAs, which was their reliance on an inflexible, hand-crafted "action vocabulary" (e.g., base velocities).
1. The benchmark and experiments focus almost exclusively on locomotion and posture (navigation, sitting), omitting manipulation tasks (e.g., picking, pushing) 2. The celebrated "zero-shot" real-world deployment uses open-loop replay of latent plans generated in simulation. The high-level vision-language policy does not run closed-loop on the real robot, weakening the claim of a fully closed-loop system. 3. Lacks comparison to a strong, non-latent baseline (e.g., a hierarchical VLA that predicts
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
