How Blind and Low-Vision Individuals Prefer Large Vision-Language Model-Generated Scene Descriptions
Na Min An, Eunki Kim, Wan Ju Kang, Sangryul Kim, James Thorne, Hyunjung Shim

TL;DR
This study evaluates how blind and low-vision individuals prefer scene descriptions generated by large vision-language models, highlighting the variability in preferences and the need for tailored evaluation metrics.
Contribution
The paper presents a user study on BLV preferences for LVLM descriptions and introduces a new automatic evaluation metric based on these insights.
Findings
User ratings varied widely in sufficiency and conciseness.
GPT-4o was not consistently preferred despite its potential.
Insights led to a new evaluation metric capturing BLV preferences.
Abstract
For individuals with blindness or low vision (BLV), navigating complex environments can pose serious risks. Large Vision-Language Models (LVLMs) show promise for generating scene descriptions, but their effectiveness for BLV users remains underexplored. To address this gap, we conducted a user study with eight BLV participants to systematically evaluate preferences for six types of LVLM descriptions. While they helped to reduce fear and improve actionability, user ratings showed wide variation in sufficiency and conciseness. Furthermore, GPT-4o--despite its strong potential to refine descriptions--was not consistently preferred by participants. We use the insights obtained from the user study to build training data for building our new automatic evaluation metric that can capture BLV preferences effectively. Our findings underscore the urgent need for BLV-centered evaluation metrics and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
