"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

Ziyi Zhang; Zhen Sun; Zongmin Zhang; Zifan Peng; Yuemeng Zhao; Zichun Wang; Zeren Luo; Ruiting Zuo; Xinlei He

arXiv:2505.04488·cs.CV·December 5, 2025

"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

Ziyi Zhang, Zhen Sun, Zongmin Zhang, Zifan Peng, Yuemeng Zhao, Zichun Wang, Zeren Luo, Ruiting Zuo, Xinlei He

PDF

Open Access

TL;DR

This paper evaluates real-time VideoLLMs for assisting visually impaired individuals, introducing a new benchmark and dataset, and demonstrating improved hazard perception through fine-tuning models.

Contribution

It is the first comprehensive evaluation of VideoLLMs in real-world assistive scenarios for the visually impaired, including new datasets and fine-tuning methods.

Findings

01

GPT-4o achieves highest task success rate

02

Fine-tuning VITA-1.5 improves hazard recognition from 25% to 76%

03

User study highlights concerns about hazard perception

Abstract

The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTactile and Sensory Interactions · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsFocus