AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie; Tengda Han; Max Bain; Arsha Nagrani; G\"ul Varol; Weidi; Xie; Andrew Zisserman

arXiv:2407.15850·cs.CV·November 25, 2024

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, G\"ul Varol, Weidi, Xie, Andrew Zisserman

PDF

Open Access 1 Repo

TL;DR

AutoAD-Zero introduces a training-free method leveraging off-the-shelf visual-language and language models to generate audio descriptions for movies and TV series, achieving state-of-the-art results without fine-tuning.

Contribution

It develops a novel two-stage prompting approach and creates a new TV audio description dataset, enabling effective zero-shot AD generation.

Findings

01

Achieves state-of-the-art CRITIC scores in AD generation

02

Demonstrates successful character referencing without fine-tuning

03

Outperforms some fine-tuned models in zero-shot setting

Abstract

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jyxarthur/AutoAD-Zero
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Natural Language Processing Techniques · Music and Audio Processing