NowYouSee Me: Context-Aware Automatic Audio Description
Seon-Ho Lee, Jue Wang, David Fan, Zhikang Zhang, Linda Liu, Xiang Hao,, Vimal Bhat, Xinyu Li

TL;DR
This paper introduces CA3D, a novel end-to-end system for automatic, context-aware audio description that accurately localizes visual events in cinematic content using only visual cues, enhancing accessibility for visually impaired audiences.
Contribution
The paper presents CA3D, the first unified system that detects and generates audio descriptions solely from visual information, improving automation and accuracy over previous metadata-dependent methods.
Findings
Achieves state-of-the-art performance in AD event detection
Improves script generation metrics significantly
Demonstrates effectiveness across diverse cinematic content
Abstract
Audio Description (AD) plays a pivotal role as an application system aimed at guaranteeing accessibility in multimedia content, which provides additional narrations at suitable intervals to describe visual elements, catering specifically to the needs of visually impaired audiences. In this paper, we introduce , the pioneering unified Context-Aware Automatic Audio Description system that provides AD event scripts with precise locations in the long cinematic content. Specifically, system consists of: 1) a Temporal Feature Enhancement Module to efficiently capture longer term dependencies, 2) an anchor-based AD event detector with feature suppression module that localizes the AD events and extracts discriminative feature for AD generation, and 3) a self-refinement module that leverages the generated output to tweak AD event boundaries from coarse to fine.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
