TL;DR
This paper presents a method to recognize group activities from skeletal data without requiring individual action labels, using pseudo-labels from pre-trained features and a lightweight architecture, achieving competitive results.
Contribution
It introduces a novel approach to train group activity recognition models solely with sequence-level labels and pseudo-labels, reducing the need for detailed annotations.
Findings
Models trained without individual supervision perform poorly.
Pseudo-labels from pre-trained features achieve comparable performance.
Lean pose-only architecture rivals complex multimodal models.
Abstract
To understand human behavior we must not just recognize individual actions but model possibly complex group activity and interactions. Hierarchical models obtain the best results in group activity recognition but require fine grained individual action annotations at the actor level. In this paper we show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level. Our experiments show that models trained without individual action supervision perform poorly. On the other hand we show that pseudo-labels can be computed from any pre-trained feature extractor with comparable final performance. Finally our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
