Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization
Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, Wei Tang

TL;DR
This paper introduces RefactorNet, a novel approach that decouples and recombines action and co-occurrence features in videos to improve the accuracy of temporal action localization.
Contribution
The paper proposes a new feature decoupling and recombination method that enhances action localization by emphasizing salient action content.
Findings
Significant performance improvements on THUMOS14 and ActivityNet v1.3 datasets.
Effective decoupling of action and co-occurrence features.
Improved localization accuracy with a simple detector.
Abstract
The main challenge of Temporal Action Localization is to retrieve subtle human actions from various co-occurring ingredients, e.g., context and background, in an untrimmed video. While prior approaches have achieved substantial progress through devising advanced action detectors, they still suffer from these co-occurring ingredients which often dominate the actual action content in videos. In this paper, we explore two orthogonal but complementary aspects of a video snippet, i.e., the action features and the co-occurrence features. Especially, we develop a novel auxiliary task by decoupling these two types of features within a video snippet and recombining them to generate a new feature representation with more salient action information for accurate action localization. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
