A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection
Matthew Korban, Peter Youngs, Scott T. Acton

TL;DR
This paper introduces a novel spatiotemporal transformer network that effectively models semantic and motion features for improved action detection in untrimmed videos, outperforming current state-of-the-art methods.
Contribution
The paper proposes a new transformer-based model with semantic attention, motion-aware encoding, and sequence-based temporal attention for enhanced action detection.
Findings
Outperforms state-of-the-art on AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens datasets.
Effectively models spatiotemporal interactions and dynamic variations in videos.
Introduces a motion-aware positional encoding and sequence-based temporal attention mechanism.
Abstract
This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
