Explaining CLIP's performance disparities on data from blind/low vision users
Daniela Massiceti, Camilla Longden, Agnieszka S{\l}owik, Samuel Wills,, Martin Grayson, Cecily Morrison

TL;DR
This paper evaluates CLIP's performance on images from blind/low vision users, revealing significant disparities due to content, quality, and text sensitivity, and explores mitigation strategies.
Contribution
It systematically assesses CLIP's performance disparities on BLV user data and analyzes dataset biases, proposing few-shot learning as a mitigation approach.
Findings
CLIP accuracy drops by 15 percentage points on BLV images.
Disparities are linked to content, quality, and textual description sensitivities.
Few-shot learning with 5 images can reduce performance gaps.
Abstract
Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Accessibility for Disabilities · Domain Adaptation and Few-Shot Learning · Tactile and Sensory Interactions
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
