Deep soccer captioning with transformer: dataset, semantics-related   losses, and multi-level evaluation

Ahmad Hammoudeh; Bastien Vanderplaetse; St\'ephane Dupont

arXiv:2202.05728·cs.CV·December 1, 2022·1 cites

Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation

Ahmad Hammoudeh, Bastien Vanderplaetse, St\'ephane Dupont

PDF

Open Access

TL;DR

This paper introduces a new dataset, a multi-component transformer-based model, and a comprehensive three-level evaluation for generating descriptive captions for soccer videos, significantly improving caption diversity and accuracy.

Contribution

It presents a novel dataset, a multi-modal transformer model, and a triple-level evaluation framework for soccer video captioning, advancing the state-of-the-art in semantic and diversity metrics.

Findings

01

Semantic losses increased caption diversity from 0.07 to 0.18.

02

Using multiple visual features improved normalized captioning scores by 28%.

03

The model effectively integrates language and vision for detailed soccer video descriptions.

Abstract

This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn vision, and a fusion of linguistic and visual features generates captions. The paper suggests evaluating generated captions at three levels: syntax (the commonly used evaluation metrics such as BLEU-score and CIDEr), meaning (the quality of descriptions for a domain expert), and corpus (the diversity of generated captions). The paper shows that the diversity of generated captions has improved (from 0.07 reaching 0.18) with semantics-related losses that prioritize selected words.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications