Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

Yu Li; Yuchen Zheng; Giles Hamilton-Fletcher; Marco Mezzavilla; Yao Wang; Sundeep Rangan; Maurizio Porfiri; Zhou Yu; John-Ross Rizzo

arXiv:2603.15624·cs.CV·March 18, 2026

Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

Yu Li, Yuchen Zheng, Giles Hamilton-Fletcher, Marco Mezzavilla, Yao Wang, Sundeep Rangan, Maurizio Porfiri, Zhou Yu, John-Ross Rizzo

PDF

Open Access

TL;DR

This study evaluates vision-language models' capabilities in aiding navigation for people with blindness and low vision, highlighting their strengths and limitations in visual reasoning and scene understanding.

Contribution

It provides a comprehensive comparison of state-of-the-art VLMs for navigation assistance, identifying performance gaps and guiding future improvements for assistive technology integration.

Findings

01

GPT-4o outperforms other models in spatial reasoning and scene understanding.

02

Open-source models struggle with nuanced reasoning and complex environments.

03

Common challenges include object counting in cluttered scenes and spatial bias.

Abstract

This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTactile and Sensory Interactions · Multimodal Machine Learning Applications · Spatial Cognition and Navigation