NowYouSee Me: Context-Aware Automatic Audio Description

Seon-Ho Lee; Jue Wang; David Fan; Zhikang Zhang; Linda Liu; Xiang Hao,; Vimal Bhat; Xinyu Li

arXiv:2412.10002·cs.CV·December 16, 2024

NowYouSee Me: Context-Aware Automatic Audio Description

Seon-Ho Lee, Jue Wang, David Fan, Zhikang Zhang, Linda Liu, Xiang Hao,, Vimal Bhat, Xinyu Li

PDF

TL;DR

This paper introduces CA3D, a novel end-to-end system for automatic, context-aware audio description that accurately localizes visual events in cinematic content using only visual cues, enhancing accessibility for visually impaired audiences.

Contribution

The paper presents CA3D, the first unified system that detects and generates audio descriptions solely from visual information, improving automation and accuracy over previous metadata-dependent methods.

Findings

01

Achieves state-of-the-art performance in AD event detection

02

Improves script generation metrics significantly

03

Demonstrates effectiveness across diverse cinematic content

Abstract

Audio Description (AD) plays a pivotal role as an application system aimed at guaranteeing accessibility in multimedia content, which provides additional narrations at suitable intervals to describe visual elements, catering specifically to the needs of visually impaired audiences. In this paper, we introduce $C A^{3} D$ , the pioneering unified Context-Aware Automatic Audio Description system that provides AD event scripts with precise locations in the long cinematic content. Specifically, $C A^{3} D$ system consists of: 1) a Temporal Feature Enhancement Module to efficiently capture longer term dependencies, 2) an anchor-based AD event detector with feature suppression module that localizes the AD events and extracts discriminative feature for AD generation, and 3) a self-refinement module that leverages the generated output to tweak AD event boundaries from coarse to fine.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.