Pixels or Positions? Benchmarking Modalities in Group Activity Recognition
Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem

TL;DR
This paper introduces SoccerNet-GAR, a benchmark dataset for comparing video and tracking modalities in group activity recognition, revealing tracking's superior efficiency and accuracy in sports contexts.
Contribution
The work provides a new multimodal dataset, a unified evaluation protocol, and a novel role-aware graph neural network architecture for tracking-based group activity recognition.
Findings
Tracking-based classifiers outperform video-based ones in accuracy.
Tracking models require significantly less training time and fewer parameters.
The dataset enables standardized comparison of modalities in sports group activity recognition.
Abstract
Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the matches of the football World Cup 2022. Specifically, the broadcast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
