NarrationBot and InfoBot: A Hybrid System for Automated Video Description
Shasta Ihorn, Yue-Ting Siu, Aditya Bodi, Lothar Narins, Jose M., Castanon, Yash Kant, Abhishek Das, Ilmi Yoon, Pooyan Fazli

TL;DR
NarrationBot and InfoBot are a hybrid system that automatically generates and enhances video descriptions, significantly improving accessibility for blind and low vision users and enabling more efficient video content engagement.
Contribution
The paper introduces a novel hybrid system combining automatic video description generation and interactive querying, enhancing accessibility and user experience for visually impaired viewers.
Findings
System improved user comprehension and enjoyment
No significant difference between autogenerated and human-revised descriptions
High user enthusiasm for the system
Abstract
Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional and amateur services and tools, most human-generated descriptions are expensive and time consuming. Moreover, the rate of human-generated descriptions cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a hybrid system of two tools to 1) automatically generate descriptions for videos and 2) provide answers or additional descriptions in response to user queries on a video. Results from a mixed-methods study with 26 blind and low vision individuals show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem. In addition, participants reported no significant difference in their ability to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
