TubeR: Tubelet Transformer for Video Action Detection
Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu,, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees, G.M. Snoek, Joseph Tighe

TL;DR
TubeR introduces a novel end-to-end transformer-based approach for video action detection that models spatio-temporal dynamics directly, outperforming previous methods on standard benchmarks.
Contribution
It presents a simple, unified model that detects action tubelets without relying on actor detectors or hand-crafted proposals, using tubelet-queries and attention mechanisms.
Findings
Outperforms previous state-of-the-art on AVA, UCF101-24, JHMDB51-21 datasets.
Effectively models dynamic spatio-temporal features with tubelet-attention.
Maintains good performance on long video clips.
Abstract
We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet-queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsAttentive Walk-Aggregating Graph Neural Network
