UrbanVLA: A Vision-Language-Action Model for Urban Micromobility
Anqi Li, Zhiyong Wang, Jiazhao Zhang, Minghan Li, Yunpeng Qi, Zhibo Chen, Zhizheng Zhang, He Wang

TL;DR
UrbanVLA is a scalable vision-language-action model that enables reliable urban micromobility navigation by aligning route waypoints with visual observations and employing a two-stage training pipeline, outperforming existing methods.
Contribution
The paper introduces UrbanVLA, a novel route-conditioned VLA framework with a two-stage training process for scalable and robust urban navigation in micromobility applications.
Findings
UrbanVLA surpasses baselines by over 55% in SocialNav tasks.
It achieves reliable real-world navigation in large-scale urban environments.
The model demonstrates robustness against real-world uncertainties.
Abstract
Urban micromobility applications, such as delivery robots, demand reliable navigation across large-scale urban environments while following long-horizon route instructions. This task is particularly challenging due to the dynamic and unstructured nature of real-world city areas, yet most existing navigation methods remain tailored to short-scale and controllable scenarios. Effective urban micromobility requires two complementary levels of navigation skills: low-level capabilities such as point-goal reaching and obstacle avoidance, and high-level capabilities, such as route-visual alignment. To this end, we propose UrbanVLA, a route-conditioned Vision-Language-Action (VLA) framework designed for scalable urban navigation. Our method explicitly aligns noisy route waypoints with visual observations during execution, and subsequently plans trajectories to drive the robot. To enable UrbanVLA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
