AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
Tengda Han, Max Bain, Arsha Nagrani, G\"ul Varol, Weidi Xie, Andrew, Zisserman

TL;DR
This paper introduces a comprehensive model for automatic movie audio description that addresses who, when, and what questions by integrating character identification, temporal decision models, and a new vision-language architecture, improving description quality for visually impaired audiences.
Contribution
It presents a novel integrated approach combining character banks, temporal models, and a new vision-language architecture for improved automatic movie audio description.
Findings
Enhanced character naming accuracy in AD
Effective temporal interval selection for AD generation
Improved AD quality over previous models
Abstract
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
AutoAD II: The Sequel – Who, When, and What in Movie Audio Description· youtube
Taxonomy
TopicsSubtitles and Audiovisual Media · Infrastructure Maintenance and Monitoring
MethodsContrastive Language-Image Pre-training
