A Semantic and Motion-Aware Spatiotemporal Transformer Network for   Action Detection

Matthew Korban; Peter Youngs; Scott T. Acton

arXiv:2405.08204·cs.CV·May 15, 2024

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection

Matthew Korban, Peter Youngs, Scott T. Acton

PDF

TL;DR

This paper introduces a novel spatiotemporal transformer network that effectively models semantic and motion features for improved action detection in untrimmed videos, outperforming current state-of-the-art methods.

Contribution

The paper proposes a new transformer-based model with semantic attention, motion-aware encoding, and sequence-based temporal attention for enhanced action detection.

Findings

01

Outperforms state-of-the-art on AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens datasets.

02

Effectively models spatiotemporal interactions and dynamic variations in videos.

03

Introduces a motion-aware positional encoding and sequence-based temporal attention mechanism.

Abstract

This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.