LLM-AD: Large Language Model based Audio Description System

Peng Chu; Jiang Wang; Andre Abrantes

arXiv:2405.00983·cs.CV·May 3, 2024·1 cites

LLM-AD: Large Language Model based Audio Description System

Peng Chu, Jiang Wang, Andre Abrantes

PDF

Open Access

TL;DR

This paper presents an automated audio description generation system using GPT-4V(ision), which produces high-quality, contextually consistent descriptions without additional training, enhancing video accessibility.

Contribution

The proposed pipeline leverages GPT-4V(ision) and a tracking module to generate standards-compliant ADs without extra training, maintaining character consistency across frames.

Findings

01

Achieves CIDEr score of 20.5 on MAD dataset

02

Performs comparably to learning-based methods

03

Uses readily available components without additional training

Abstract

The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor, while existing automated approaches still necessitate extensive training to integrate multimodal inputs and tailor the output from a captioning style to an AD style. In this paper, we introduce an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision). Notably, our methodology employs readily available components, eliminating the need for additional training. It produces ADs that not only comply with established natural language AD production standards but also maintain contextually consistent character information across frames, courtesy of a tracking-based character recognition module. A thorough analysis on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis