A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision

Alexey Magay; Dhurba Tripathi; Yu Hao; Yi Fang

arXiv:2505.10875·cs.CV·May 19, 2025

A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision

Alexey Magay, Dhurba Tripathi, Yu Hao, Yi Fang

PDF

Open Access

TL;DR

This paper introduces a lightweight, wearable assistive device enhanced with a multimodal foundation model that significantly improves spatial reasoning and environmental understanding for people with blindness and low vision.

Contribution

It presents a novel spatially-aware multimodal large language model integrated into a wearable device, improving navigation and object recognition for visually impaired users.

Findings

01

Enhanced environmental understanding and navigation accuracy

02

Significant improvements in object recognition performance

03

Positive user feedback on device usability and effectiveness

Abstract

People with blindness and low vision (pBLV) face significant challenges, struggling to navigate environments and locate objects due to limited visual cues. Spatial reasoning is crucial for these individuals, as it enables them to understand and interpret the spatial relationships in their surroundings, enhancing their ability to navigate and interact more safely and independently. Current multi-modal large language (MLLM) models for low vision people lack the spatial reasoning capabilities needed to effectively assist in these tasks. Moreover, there is a notable absence of lightweight, easy-to-use systems that allow pBLV to effectively perceive and interact with their surrounding environment. In this paper, we propose a novel spatial enhanced multi-modal large language model based approach for visually impaired individuals. By fine-tuning the MLLM to incorporate spatial reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTactile and Sensory Interactions · Multimodal Machine Learning Applications · Speech and dialogue systems

MethodsAttentive Walk-Aggregating Graph Neural Network