VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation

Tairan He; Zi Wang; Haoru Xue; Qingwei Ben; Zhengyi Luo; Wenli Xiao; Ye Yuan; Xingye Da; Fernando Casta\~neda; Shankar Sastry; Changliu Liu; Guanya Shi; Linxi Fan; and Yuke Zhu

arXiv:2511.15200·cs.RO·December 1, 2025

VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation

Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Casta\~neda, Shankar Sastry, Changliu Liu, Guanya Shi, Linxi Fan, and Yuke Zhu

PDF

Open Access

TL;DR

VIRAL is a scalable visual sim-to-real framework enabling humanoid robots to learn loco-manipulation skills entirely in simulation and deploy them zero-shot in real-world scenarios, achieving expert-level performance.

Contribution

The paper introduces VIRAL, a novel large-scale visual sim-to-real approach with a teacher-student design, enabling zero-shot transfer of humanoid loco-manipulation skills.

Findings

01

Scaling simulation to 64 GPUs improves training reliability.

02

VIRAL achieves up to 54 continuous loco-manipulation cycles in real-world deployment.

03

RGB-based policy generalizes across diverse environments without fine-tuning.

Abstract

A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Robotic Locomotion and Control · Social Robot Interaction and HRI