CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?
Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li, Yushi Li, Jing Li, Haofen Wang

TL;DR
CitySeeker introduces a new benchmark to evaluate vision-language models' ability to understand and act on implicit human needs in urban navigation, revealing current limitations and proposing strategies for improvement.
Contribution
This work presents CitySeeker, a comprehensive benchmark for embodied urban navigation based on implicit needs, and analyzes the challenges and potential strategies to enhance VLMs' spatial reasoning.
Findings
Top models achieve only 21.1% task completion.
Error accumulation hampers long-horizon reasoning.
Strategies like backtracking and memory retrieval improve performance.
Abstract
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear and Well-Structured: The paper is well-organized, with thorough explanations of the data collection process, benchmark design, and task formulation. - Novel and Interesting Setting: The paper proposed the task of embodied urban navigation guided by implicit human needs. This task is currently not widely explored and has significant potential for real-world deployment. - Extensive Evaluations: A wide range of VLMs are evaluated on the curated benchmarks, accompanied by comprehensive ana
I don't find significant weaknesses in this submission. However, I do have some concerns as follows. Therefore, I give a conservative score of borderline accept. I may consider increasing the rating if the authors adequately address these concerns. - Open-Source Model Superiority: The paper observes that open-source VLMs (such as Qwen) occasionally outperform the proprietary VLMs. The submission would benefit from a deeper analysis of the underlying reasons behind this phenomenon. - Presentati
Addresses an interesting interdisciplinary question. The dataset construction across multiple major cities, combining geospatial and textual information. The evaluation is systematic, covering both semantic matching and spatial reasoning tasks. The idea of connecting natural language intent to spatial decision-making is novel and potentially impactful.
The mapping from need to POI type assumes a fixed, deterministic relationship, which may not hold in practice, human intent is subjective and context-dependent. The model implicitly assumes people choose the shortest path or most direct POI option, which is unrealistic, behavioral factors like preference, familiarity, and accessibility play major roles. Cross-cultural generalization is a concern: the same “need” may imply different POIs across societies. The paper lacks an analysis of cultura
1. The motivation of this paper is well-grounded. Identifying that VLMs’ ability to interpret implicit human needs in dynamic urban environments remains underexplored is both timely and significant. It provides a new angle for examining VLMs’ world knowledge and decision-making capabilities. 2. The paper is clearly written and visually appealing. The figures effectively illustrate the framework and experiments, helping readers understand the design and reasoning process. 3. The authors conduct
1. Overall, this is a clear accept-level paper in terms of novelty, clarity, and experimental depth. However, there is a **serious ethics concern**. The paper states: *“CitySeeker dataset was sourced from publicly available APIs (Google Maps and Baidu Maps) and is used in accordance with their terms of service for non-commercial research purposes only.”* After reviewing Google Maps’ Terms of Service[1], it explicitly states: **“Downloading Street View images to use separately from Googl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
