ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Maryam Cheema; Sina Elahimanesh; Pooyan Fazli; and Hasti Seifi

arXiv:2603.14662·cs.HC·March 17, 2026

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Maryam Cheema, Sina Elahimanesh, Pooyan Fazli, and Hasti Seifi

PDF

Open Access

TL;DR

ViDscribe is a web platform that uses multimodal AI to generate customizable audio descriptions and question answering for online videos, enhancing accessibility for blind and low vision users through personalization and interaction.

Contribution

The paper introduces ViDscribe, a novel platform combining AI-generated customizable audio descriptions with a conversational VQA interface, tested in real-world settings with BLV users.

Findings

01

Customized ADs improve user engagement and satisfaction.

02

Users prefer personalized descriptions over default ones.

03

Interactive features increase immersion for BLV viewers.

Abstract

Advances in multimodal large language models enable automatic video narration and question answering (VQA), offering scalable alternatives to labor-intensive, human-authored audio descriptions (ADs) for blind and low vision (BLV) viewers. However, prior AI-driven AD systems rarely adapt to the diverse needs and preferences of BLV individuals across videos and are typically evaluated in controlled, single-session settings. We present ViDscribe, a web-based platform that integrates AI-generated ADs with six types of user customizations and a conversational VQA interface for YouTube videos. Through a longitudinal, in-the-wild study with eight BLV participants, we examine how users engage with customization and VQA features over time. Our results show sustained engagement with both features and that customized ADs improve effectiveness, enjoyment, and immersion compared to default ADs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Speech Recognition and Synthesis