Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision
Yu Li, Yuchen Zheng, Giles Hamilton-Fletcher, Marco Mezzavilla, Yao Wang, Sundeep Rangan, Maurizio Porfiri, Zhou Yu, John-Ross Rizzo

TL;DR
This study evaluates vision-language models' capabilities in aiding navigation for people with blindness and low vision, highlighting their strengths and limitations in visual reasoning and scene understanding.
Contribution
It provides a comprehensive comparison of state-of-the-art VLMs for navigation assistance, identifying performance gaps and guiding future improvements for assistive technology integration.
Findings
GPT-4o outperforms other models in spatial reasoning and scene understanding.
Open-source models struggle with nuanced reasoning and complex environments.
Common challenges include object counting in cluttered scenes and spatial bias.
Abstract
This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTactile and Sensory Interactions · Multimodal Machine Learning Applications · Spatial Cognition and Navigation
