GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

Yufei Liu; Xieyuanli Chen; Hainan Pan; Chenghao Shi; Yanjie Chen; Kaihong Huang; Zhiwen Zeng; Huimin Lu

arXiv:2603.07624·cs.RO·March 10, 2026

GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

Yufei Liu, Xieyuanli Chen, Hainan Pan, Chenghao Shi, Yanjie Chen, Kaihong Huang, Zhiwen Zeng, Huimin Lu

PDF

Open Access

TL;DR

GeoLoco introduces a novel RGB-only humanoid locomotion method that leverages 3D geometric priors from a frozen visual foundation model, enabling robust zero-shot sim-to-real transfer without depth sensors.

Contribution

It proposes a scale-aware 3D latent representation from monocular images using a frozen visual foundation model and a cross-attention mechanism for improved humanoid locomotion.

Findings

01

Achieves robust zero-shot transfer to real humanoid robots

02

Successfully negotiates challenging terrains in real-world tests

03

Outperforms depth-based methods in certain scenarios

Abstract

The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation