MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

Lipisha Chaudhary; Trisha Mittal; Subhadra Gopalakrishnan; Ifeoma Nwogu; Jaclyn Pytlarz

arXiv:2511.09448·cs.MM·November 13, 2025

MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

Lipisha Chaudhary, Trisha Mittal, Subhadra Gopalakrishnan, Ifeoma Nwogu, Jaclyn Pytlarz

PDF

Open Access

TL;DR

This paper introduces MCAD, an end-to-end system that generates audio descriptions for soccer videos by leveraging multimodal context and fine-tuned large language models, extending AD beyond movies to sports.

Contribution

The work presents a novel pipeline for soccer AD generation without relying on ground truth AD, including a new evaluation metric and a dataset of annotated soccer clips.

Findings

01

MCAD effectively generates context-aware AD for soccer videos.

02

The ARGE-AD metric accurately assesses AD quality across domains.

03

The approach outperforms baseline methods in descriptive accuracy.

Abstract

Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications