MCAD: Multimodal Context-Aware Audio Description Generation For Soccer
Lipisha Chaudhary, Trisha Mittal, Subhadra Gopalakrishnan, Ifeoma Nwogu, Jaclyn Pytlarz

TL;DR
This paper introduces MCAD, an end-to-end system that generates audio descriptions for soccer videos by leveraging multimodal context and fine-tuned large language models, extending AD beyond movies to sports.
Contribution
The work presents a novel pipeline for soccer AD generation without relying on ground truth AD, including a new evaluation metric and a dataset of annotated soccer clips.
Findings
MCAD effectively generates context-aware AD for soccer videos.
The ARGE-AD metric accurately assesses AD quality across domains.
The approach outperforms baseline methods in descriptive accuracy.
Abstract
Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications
