WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control
Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, Hongyang Li

TL;DR
WholeBodyVLA introduces a unified learning framework for humanoid robots that combines vision, language, and action to improve loco-manipulation control, enabling large-space tasks with high precision and stability.
Contribution
The paper presents a novel unified latent learning framework and a loco-manipulation-oriented RL policy for humanoid robots, addressing data scarcity and execution precision issues.
Findings
Outperforms prior methods by 21.3% on AgiBot X2
Demonstrates strong generalization across tasks
Enables large-space humanoid loco-manipulation
Abstract
Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset…
Peer Reviews
Decision·ICLR 2026 Poster
**Originality** WholeBodyVLA introduces a unified vision-language-action framework for humanoid loco-manipulation, extending latent action learning from tabletop manipulation to full-body coordination. The work’s key novelty lies in the dual-LAM design, which explicitly separates locomotion and manipulation latent spaces to mitigate conflicts between camera ego-motion and hand–object motion—an insightful adaptation that broadens latent learning’s applicability. Additionally, the proposed Loco-M
**Insufficient Evidence for “Unified” Whole-Body Control** Despite the “unified” claim, the system decouples upper-body manipulation and lower-body locomotion: the VLA predicts separate manipulation and locomotion latents, a lightweight decoder outputs arm joint targets and a discrete locomotion command, and the LMO policy executes lower-body torques. This constitutes co-scheduled rather than end-to-end joint whole-body control, relying on downstream modules to handle cross-coupling. **Limited
1. Clear paper writing and problem formulation. The paper identifies the gap between modular pipelines and true end-to-end manipulation-aware locomotion, motivating why coupling is necessary for stability and task success.2. Unified latent learning with modality-separated LAMs. The paper identified the sub-optimality of using mixed-data to train a single LAM and propose to train separate LAMs for manip and locomotion 2. Separation of latent action spaces. Training distinct latent action models f
1. Ambiguity in loco-manipulation demands of the tasks. In all three demos, target objects appear within immediate arm’s reach at the start, so grasping could be completed without meaningful locomotion. Even though the side stepping motion is stable, it is still difficult to determine whether improvements arise from integrated loco-manipulation versus decoupled manipulation with stepping. 2. Baseline fairness. Some baselines were not evidently fine-tuned on the same Agibot/LMO data or adapted t
1. The paper clearly identifies the missing link between manipulation-aware locomotion and VLA-based manipulation, proposing the first end-to-end unified framework that integrates both in real-world humanoid control. 2. The discrete command interface + structured perturbation RL is elegantly engineered. The two-stage curriculum and well-defined reward shaping (directional accuracy, stand-still penalty) demonstrate deep insight into locomotion stability and precision. 3. Evaluation on multiple re
1. The upper-body movements demonstrated in the three tasks appear quite limited — the shoulder seems to move primarily along the pitch axis. However, since the tabletop objects can be positioned in more diverse locations, introducing richer interaction motions could better showcase the capabilities of the VLA. 2. The current tabletop task is rather simple and singular. Did the authors also collect additional related data that could be used to supplement or expand the current manipulation tasks?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Human Motion and Animation
