WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

Rafi Ibn Sultan; Hui Zhu; Xiangyu Zhou; Chengyin Li; Prashant Khanduri; Marco Brocanelli; Dongxiao Zhu

arXiv:2603.10703·cs.CV·March 12, 2026

WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

Rafi Ibn Sultan, Hui Zhu, Xiangyu Zhou, Chengyin Li, Prashant Khanduri, Marco Brocanelli, Dongxiao Zhu

PDF

Open Access 1 Models 1 Datasets

TL;DR

WalkGPT is a novel vision-language model that provides depth-aware, grounded navigation guidance for pedestrians by integrating segmentation and language reasoning, improving accessibility in complex urban scenes.

Contribution

It introduces WalkGPT, a pixel-grounded LVLM with depth reasoning capabilities, and PAVE, a large-scale benchmark for accessibility-aware pedestrian navigation.

Findings

01

WalkGPT achieves strong grounded reasoning performance.

02

The model effectively integrates segmentation and depth estimation.

03

PAVE dataset enables comprehensive evaluation of navigation models.

Abstract

Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
rafiibnsultan/walkgpt-13b
model· 14 dl
14 dl

Datasets

rafiibnsultan/PAVE
dataset· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization