TL;DR
This paper introduces a graph-based framework that models high-level spatio-temporal interactions in videos for action detection, leveraging self-attention on a multi-layer graph to capture long-range dependencies, achieving state-of-the-art results.
Contribution
It proposes a backbone-independent, non-end-to-end graph-based approach for learning spatio-temporal relationships in video action detection.
Findings
State-of-the-art results on AVA dataset
Consistent improvements over various backbones
Effective modeling of long-range spatial-temporal dependencies
Abstract
Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modelling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
