More than a Moment: Towards Coherent Sequences of Audio Descriptions

Eshika Khandelwal; Junyu Xie; Tengda Han; Max Bain; Arsha Nagrani; Andrew Zisserman; G\"ul Varol; Makarand Tapaswi

arXiv:2510.25440·cs.CV·October 30, 2025

More than a Moment: Towards Coherent Sequences of Audio Descriptions

Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, G\"ul Varol, Makarand Tapaswi

PDF

TL;DR

This paper introduces CoherentAD, a training-free method for generating coherent audio description sequences that improve narrative flow and reduce redundancy, evaluated with a new sequence-level metric called StoryRecall.

Contribution

The paper presents a novel training-free approach for generating coherent audio description sequences and introduces the StoryRecall metric for holistic evaluation.

Findings

01

CoherentAD outperforms prior independent generation methods.

02

The sequence-level metric StoryRecall effectively measures narrative coherence.

03

Enhanced narrative understanding demonstrated in experimental results.

Abstract

Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.