Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation
Ahmad Hammoudeh, Bastien Vanderplaetse, St\'ephane Dupont

TL;DR
This paper introduces a new dataset, a multi-component transformer-based model, and a comprehensive three-level evaluation for generating descriptive captions for soccer videos, significantly improving caption diversity and accuracy.
Contribution
It presents a novel dataset, a multi-modal transformer model, and a triple-level evaluation framework for soccer video captioning, advancing the state-of-the-art in semantic and diversity metrics.
Findings
Semantic losses increased caption diversity from 0.07 to 0.18.
Using multiple visual features improved normalized captioning scores by 28%.
The model effectively integrates language and vision for detailed soccer video descriptions.
Abstract
This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn vision, and a fusion of linguistic and visual features generates captions. The paper suggests evaluating generated captions at three levels: syntax (the commonly used evaluation metrics such as BLEU-score and CIDEr), meaning (the quality of descriptions for a domain expert), and corpus (the diversity of generated captions). The paper shows that the diversity of generated captions has improved (from 0.07 reaching 0.18) with semantics-related losses that prioritize selected words.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications
