Beyond Pixels: Leveraging the Language of Soccer to Improve Spatio-Temporal Action Detection in Broadcast Videos

Jeremie Ochin; Raphael Chekroun; Bogdan Stanciulescu; Sotiris Manitsaris

arXiv:2505.09455·cs.CV·February 2, 2026

Beyond Pixels: Leveraging the Language of Soccer to Improve Spatio-Temporal Action Detection in Broadcast Videos

Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris

PDF

TL;DR

This paper introduces a Transformer-based approach that uses the language of soccer, including game context and tactics, to enhance spatio-temporal action detection accuracy in broadcast videos, especially in low-precision scenarios.

Contribution

It proposes a novel sequence transduction method that incorporates game-level context and tactical regularities to improve action detection in soccer videos.

Findings

01

Improved precision and recall in low-confidence regimes.

02

Enhanced event extraction accuracy from broadcast videos.

03

Leverages soccer's tactical language for better modeling.

Abstract

State-of-the-art spatio-temporal action detection (STAD) methods show promising results for extracting soccer events from broadcast videos. However, when operated in the high-recall, low-precision regime required for exhaustive event coverage in soccer analytics, their lack of contextual understanding becomes apparent: many false positives could be resolved by considering a broader sequence of actions and game-state information. In this work, we address this limitation by reasoning at the game level and improving STAD through the addition of a denoising sequence transduction task. Sequences of noisy, context-free player-centric predictions are processed alongside clean game state information using a Transformer-based encoder-decoder model. By modeling extended temporal context and reasoning jointly over team-level dynamics, our method leverages the "language of soccer" - its tactical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.