SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
Ismael Elsharkawi, Ahmed Sait, Silvio Giancola, Bernard Ghanem, Hossam Sharara, Abdelrahman Eldesokey

TL;DR
SoccerLens introduces a new benchmark for evaluating soccer video understanding models' ability to ground visual cues, revealing that current models lack true visual grounding despite high classification accuracy.
Contribution
The paper presents SoccerLens, a grounded evaluation benchmark with annotated soccer videos and extended attribution methods to assess visual grounding beyond accuracy.
Findings
Current models achieve over 50% grounding performance only under loose cues.
Models underutilize temporal information in soccer videos.
There is a significant gap between classification accuracy and true visual grounding.
Abstract
Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
