CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding
Xingyu "Bruce" Liu, Ruolin Wang, Dingzeyu Li, Xiang 'Anthony' Chen,, Amy Pavel

TL;DR
CrossA11y is a system that automatically detects visual and auditory accessibility issues in videos using cross-modal analysis, simplifying the process for authors to add descriptions and captions effectively.
Contribution
This paper introduces CrossA11y, a novel system leveraging cross-modal grounding to automatically identify accessibility issues in videos, improving efficiency over manual methods.
Findings
Effective detection of accessibility issues demonstrated in a lab study
Participants found CrossA11y intuitive and helpful for editing videos
Compared to baseline, CrossA11y improved accessibility issue identification
Abstract
Authors make their videos visually accessible by adding audio descriptions (AD), and auditorily accessible by adding closed captions (CC). However, creating AD and CC is challenging and tedious, especially for non-professional describers and captioners, due to the difficulty of identifying accessibility problems in videos. A video author will have to watch the video through and manually check for inaccessible information frame-by-frame, for both visual and auditory modalities. In this paper, we present CrossA11y, a system that helps authors efficiently detect and address visual and auditory accessibility issues in videos. Using cross-modal grounding analysis, CrossA11y automatically measures accessibility of visual and audio segments in a video by checking for modality asymmetries. CrossA11y then displays these segments and surfaces visual and audio accessibility issues in a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
