Activity Graph Transformer for Temporal Action Localization
Megha Nawhal, Greg Mori

TL;DR
The paper introduces Activity Graph Transformer, a novel end-to-end model that uses graph reasoning to improve temporal action localization in videos, especially for non-linear and overlapping actions.
Contribution
It proposes a graph-based transformer model for directly predicting action instances, addressing limitations of sequential processing in complex video scenarios.
Findings
Outperforms state-of-the-art on THUMOS14, Charades, and EPIC-Kitchens-100 datasets.
Effectively captures non-linear temporal dependencies and overlapping actions.
Demonstrates significant accuracy improvements over existing methods.
Abstract
We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have non-sequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Laplacian EigenMap · Residual Connection · Dense Connections · Layer Normalization · Attention Is All You Need · Byte Pair Encoding · Label Smoothing
