Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions
Zhi Li, Lu He, Huijuan Xu

TL;DR
This paper introduces a weakly-supervised method for fine-grained temporal action detection in videos, modeling actions as combinations of atomic actions discovered via self-supervised clustering, and leveraging hierarchical labels.
Contribution
It proposes a novel hierarchical approach that automatically discovers atomic actions and maps them to fine and coarse labels, enabling accurate detection with limited supervision.
Findings
Achieves state-of-the-art results on FineAction and FineGym datasets.
Effectively captures subtle differences between fine-grained actions.
Demonstrates the benefit of hierarchical modeling for weakly-supervised action detection.
Abstract
Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time. Without the careful design to capture subtle differences between fine-grained actions, previous weakly-supervised models for general action detection cannot perform well in the fine-grained setting. We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data through self-supervised clustering, in order to capture the commonality and individuality of fine-grained actions. The learnt atomic actions, represented by visual concepts, are further mapped to fine and coarse action labels leveraging the semantic label…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
MethodsContrastive Language-Image Pre-training
