Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals
Shruti Singh Baghel, Yash Pratap Singh Rathore, Sushovan Jena, Anurag Pradhan, Amit Shukla, Arnav Bhavsar, Pawan Goyal

TL;DR
This paper evaluates lightweight vision-language models for blind and low-vision accessibility, introduces new assessment frameworks, and tests model performance on mobile devices to improve practical usability.
Contribution
It presents novel evaluation frameworks tailored for BLV accessibility and systematically assesses model size, prompt strategies, and deployment on mobile hardware.
Findings
SmolVLM2 models perform well on accessibility tasks
New frameworks effectively evaluate spatial and mobility information
Models can be deployed on smartphones with optimized precision
Abstract
Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTactile and Sensory Interactions · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media
