CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang,, Juexiao Zhang, John Abanes, Jing Zhang, Chen Feng

TL;DR
CityWalker leverages web-scale urban videos to train embodied agents for complex navigation tasks, significantly improving performance in dynamic city environments without relying on costly annotations.
Contribution
The paper introduces a scalable data-driven approach using web videos for training urban navigation agents, enabling large-scale imitation learning without manual annotations.
Findings
Training on large-scale diverse datasets improves navigation performance.
The approach surpasses existing methods in urban navigation tasks.
Web-sourced videos effectively teach complex navigation behaviors.
Abstract
Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies · Human Mobility and Location-Based Analysis · Video Analysis and Summarization
