Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Wandong Sun; Yongbo Su; Leoric Huang; Alex Zhang; Dwyane Wei; Mu San; Daniel Tian; Ellie Cao; Baoshi Cao; Yang Liu; Finn Yan; Ethan Xie; Zongwu Xie

arXiv:2602.06382·cs.RO·May 12, 2026

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Baoshi Cao, Yang Liu, Finn Yan, Ethan Xie, Zongwu Xie

PDF

TL;DR

This paper introduces an end-to-end vision-based humanoid locomotion framework that improves sim-to-real transfer and terrain adaptation using high-fidelity simulation, behavior distillation, and multi-critic learning.

Contribution

It presents novel methods for sim-to-real transfer and terrain-specific policy learning in humanoid robots using vision and deep learning techniques.

Findings

01

Robust policy performance across diverse terrains and challenges.

02

Effective sim-to-real transfer with high-fidelity depth sensor simulation.

03

Successful bidirectional staircase traversal and gap crossing.

Abstract

Achieving robust vision-based humanoid locomotion remains challenging due to two fundamental issues: the sim-to-real gap introduces significant perception noise that degrades performance on fine-grained tasks, and training a unified policy across diverse terrains is hindered by conflicting learning objectives. To address these challenges, we present an end-to-end framework for vision-driven humanoid locomotion. For robust sim-to-real transfer, we develop a high-fidelity depth sensor simulation that captures stereo matching artifacts and calibration uncertainties inherent in real-world sensing. We further propose a vision-aware behavior distillation approach that combines latent space alignment with noise-invariant auxiliary tasks, enabling effective knowledge transfer from privileged height maps to noisy depth observations. For versatile terrain adaptation, we introduce terrain-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.