LLM-AD: Large Language Model based Audio Description System
Peng Chu, Jiang Wang, Andre Abrantes

TL;DR
This paper presents an automated audio description generation system using GPT-4V(ision), which produces high-quality, contextually consistent descriptions without additional training, enhancing video accessibility.
Contribution
The proposed pipeline leverages GPT-4V(ision) and a tracking module to generate standards-compliant ADs without extra training, maintaining character consistency across frames.
Findings
Achieves CIDEr score of 20.5 on MAD dataset
Performs comparably to learning-based methods
Uses readily available components without additional training
Abstract
The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor, while existing automated approaches still necessitate extensive training to integrate multimodal inputs and tailor the output from a captioning style to an AD style. In this paper, we introduce an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision). Notably, our methodology employs readily available components, eliminating the need for additional training. It produces ADs that not only comply with established natural language AD production standards but also maintain contextually consistent character information across frames, courtesy of a tracking-based character recognition module. A thorough analysis on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis
