An Effective-Efficient Approach for Dense Multi-Label Action Detection
Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

TL;DR
This paper introduces a novel transformer-based network for dense multi-label action detection that preserves temporal positional information and models action co-occurrence efficiently, achieving state-of-the-art results.
Contribution
It proposes a non-hierarchical transformer structure with relative positional encoding and a new learning paradigm for efficient co-occurrence modeling.
Findings
Improved accuracy on dense multi-label benchmarks.
Effective preservation of temporal positional information.
Enhanced modeling of action co-occurrence relationships.
Abstract
Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer-based network that (a) employs a non-hierarchical structure when modelling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Video Analysis and Summarization · Human Pose and Action Recognition
