CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Xia Su; Ruiqi Chen; Benlin Liu; Jingwei Ma; Zonglin Di; Ranjay Krishna; Jon Froehlich

arXiv:2602.18424·cs.CV·February 23, 2026

CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, Jon Froehlich

PDF

Open Access 1 Datasets

TL;DR

CapNav introduces a benchmark to evaluate vision-language models' ability to perform indoor navigation tasks conditioned on different agent capabilities, revealing current models' limitations in handling mobility constraints and spatial reasoning.

Contribution

This work presents CapNav, a novel benchmark for capability-conditioned indoor navigation, including diverse agents, scenes, and tasks to assess VLMs' spatial reasoning under mobility constraints.

Findings

01

Current VLMs' navigation performance declines with stricter mobility constraints.

02

State-of-the-art models struggle with obstacle types requiring spatial reasoning.

03

CapNav provides a comprehensive platform for evaluating capability-aware embodied navigation.

Abstract

Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

RichardC0216/CapNav
dataset· 253 dl
253 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications