SimOn: A Simple Framework for Online Temporal Action Localization
Tuan N. Tang, Jungin Park, Kwonyoung Kim, Kwanghoon Sohn

TL;DR
SimOn introduces a simple Transformer-based framework for online temporal action localization that effectively predicts action instances from streaming videos without future frame access, outperforming previous methods on benchmark datasets.
Contribution
The paper presents a novel end-to-end Transformer framework for On-TAL that leverages past visual context and learnable embeddings, setting new state-of-the-art results.
Findings
Outperforms previous methods on THUMOS14 and ActivityNet1.3 datasets.
Achieves new state-of-the-art performance in online temporal action localization.
Demonstrates robustness and effectiveness in online detection of action start.
Abstract
Online Temporal Action Localization (On-TAL) aims to immediately provide action instances from untrimmed streaming videos. The model is not allowed to utilize future frames and any processing techniques to modify past predictions, making On-TAL much more challenging. In this paper, we propose a simple yet effective framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture in an end-to-end manner. Specifically, the model takes the current frame feature as a query and a set of past context information as keys and values of the Transformer. Different from the prior work that uses a set of outputs of the model as past contexts, we leverage the past visual context and the learnable context embedding for the current query. Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Adam · Absolute Position Encodings · Layer Normalization
