ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation
Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, Liang-Yan Gui

TL;DR
ULTRA is a unified framework enabling humanoid robots to perform autonomous, goal-directed whole-body loco-manipulation using perception and high-level commands, overcoming previous limitations in data quality, skill scalability, and reliance on motion references.
Contribution
The paper introduces ULTRA, combining physics-based neural retargeting and a multimodal controller to generate behavior from perception and high-level tasks, advancing humanoid loco-manipulation.
Findings
ULTRA outperforms tracking-only baselines in simulation and real-world tests.
It generalizes to goal-conditioned tasks from egocentric perception.
ULTRA maintains robustness under out-of-distribution scenarios.
Abstract
Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Robotic Locomotion and Control · Robot Manipulation and Learning
