MALT: Multi-scale Action Learning Transformer for Online Action Detection
Zhipeng Yang, Ruoyu Wang, Yang Tan, Liping Xie

TL;DR
This paper introduces MALT, a multi-scale transformer model for real-time online action detection that effectively captures action features at various granularities and employs an efficient frame scoring mechanism.
Contribution
MALT is a novel multi-scale transformer with a recurrent decoder and hierarchical encoder, improving real-time action detection by capturing multi-scale features and filtering irrelevant frames efficiently.
Findings
Achieved state-of-the-art performance on THUMOS'14 with 0.2% mAP.
Outperformed existing models on TVSeries with 0.1% mcAP.
Efficient training with fewer parameters due to the recurrent decoder.
Abstract
Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames. Since these actions manifest at varying scales of granularity, ranging from coarse to fine, projecting an entire set of action frames to a single latent encoding may result in a lack of local information, necessitating the acquisition of action features across multiple scales. In this paper, we propose a multi-scale action learning transformer (MALT), which includes a novel recurrent decoder (used for feature fusion) that includes fewer parameters and can be trained more efficiently. A hierarchical encoder with multiple encoding branches is further proposed to capture multi-scale action features. The output from the preceding branch is then incrementally input to the subsequent branch as part of a cross-attention calculation. In this way, output features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
MethodsSparse Evolutionary Training
