Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

Wentao Zhou; Xuweiyi Chen; Vignesh Rajagopal; Jeffrey Chen; Rohan Chandra; Zezhou Cheng

arXiv:2512.10956·cs.CV·December 12, 2025

Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra, Zezhou Cheng

PDF

Open Access

TL;DR

This paper introduces StereoWalker, a navigation model that combines stereo vision and mid-level vision modules to improve dynamic urban navigation, reducing data requirements and outperforming monocular models.

Contribution

The paper presents StereoWalker, integrating stereo inputs and mid-level vision to enhance navigation accuracy and efficiency, especially in dynamic environments, with a new stereo navigation dataset.

Findings

01

Mid-level vision improves navigation performance significantly.

02

Stereo vision outperforms monocular input in dynamic scenes.

03

StereoWalker achieves comparable results with much less training data.

Abstract

The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Pose and Action Recognition