SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation

Jiwen Zhang; Xiangyu Shi; Siyuan Wang; Zerui Li; Zhongyu Wei; Qi Wu

arXiv:2603.26837·cs.RO·March 31, 2026

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation

Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei, Qi Wu

PDF

TL;DR

SpatialAnt is a novel zero-shot robot navigation framework that uses active scene reconstruction and visual anticipation to improve performance in unseen environments, even with noisy self-reconstructions.

Contribution

It introduces a physical grounding strategy and a visual anticipation mechanism to enhance zero-shot navigation with imperfect scene reconstructions.

Findings

01

Achieved 66% success rate on R2R-CE benchmark.

02

Achieved 50.8% success rate on RxR-CE benchmark.

03

Successfully deployed on a real robot with 52% success rate.

Abstract

Vision-and-Language Navigation (VLN) has recently benefited from Multimodal Large Language Models (MLLMs), enabling zero-shot navigation. While recent exploration-based zero-shot methods have shown promising results by leveraging global scene priors, they rely on high-quality human-crafted scene reconstructions, which are impractical for real-world robot deployment. When encountering an unseen environment, a robot should build its own priors through pre-exploration. However, these self-built reconstructions are inevitably incomplete and noisy, which severely degrade methods that depend on high-quality scene reconstructions. To address these issues, we propose SpatialAnt, a zero-shot navigation framework designed to bridge the gap between imperfect self-reconstructions and robust execution. SpatialAnt introduces a physical grounding strategy to recover the absolute metric scale for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.