CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion

Yuan Hao; Ruiqi Yu; Shixin Luo; Guoteng Zhang; Jun Wu; Qiuguo Zhu

arXiv:2603.29452·cs.RO·April 2, 2026

CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion

Yuan Hao, Ruiqi Yu, Shixin Luo, Guoteng Zhang, Jun Wu, Qiuguo Zhu

PDF

TL;DR

CReF is a novel depth-conditioned humanoid locomotion framework that learns directly from raw depth data, enabling robust traversal over complex terrains without relying on explicit geometric representations.

Contribution

It introduces a single-stage fusion method combining cross-modal attention and recurrent integration to improve terrain interaction and zero-shot transfer in humanoid locomotion.

Findings

01

Demonstrates robust traversal over diverse terrains in simulation and real-world scenarios.

02

Achieves effective zero-shot transfer to complex outdoor environments.

03

Outperforms prior methods that rely on explicit geometric abstractions.

Abstract

Stable traversal over geometrically complex terrain increasingly requires exteroceptive perception, yet prior perceptive humanoid locomotion methods often remain tied to explicit geometric abstractions, either by mediating control through robot-centric 2.5D terrain representations or by shaping depth learning with auxiliary geometry-related targets. Such designs inherit the representational bias of the intermediate or supervisory target and can be restrictive for vertical structures, perforated obstacles, and complex real-world clutter. We propose CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns locomotion-relevant features directly from raw forward-facing depth without explicit geometric intermediates. CReF couples proprioception and depth tokens through proprioception-queried cross-modal attention, fuses the resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.