Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Baoshi Cao, Yang Liu, Finn Yan, Ethan Xie, Zongwu Xie

TL;DR
This paper introduces an end-to-end vision-based humanoid locomotion framework that improves sim-to-real transfer and terrain adaptation using high-fidelity simulation, behavior distillation, and multi-critic learning.
Contribution
It presents novel methods for sim-to-real transfer and terrain-specific policy learning in humanoid robots using vision and deep learning techniques.
Findings
Robust policy performance across diverse terrains and challenges.
Effective sim-to-real transfer with high-fidelity depth sensor simulation.
Successful bidirectional staircase traversal and gap crossing.
Abstract
Achieving robust vision-based humanoid locomotion remains challenging due to two fundamental issues: the sim-to-real gap introduces significant perception noise that degrades performance on fine-grained tasks, and training a unified policy across diverse terrains is hindered by conflicting learning objectives. To address these challenges, we present an end-to-end framework for vision-driven humanoid locomotion. For robust sim-to-real transfer, we develop a high-fidelity depth sensor simulation that captures stereo matching artifacts and calibration uncertainties inherent in real-world sensing. We further propose a vision-aware behavior distillation approach that combines latent space alignment with noise-invariant auxiliary tasks, enabling effective knowledge transfer from privileged height maps to noisy depth observations. For versatile terrain adaptation, we introduce terrain-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
