Attentive Action and Context Factorization
Yang Wang, Vinh Tran, Gedas Bertasius, Lorenzo Torresani, Minh Hoai

TL;DR
This paper introduces an attention-based method for human action recognition that localizes actions and context in videos without detailed annotations, improving accuracy and interpretability.
Contribution
A novel weakly supervised attentional mechanism that separates human actions from context in videos for recognition tasks.
Findings
Improved action recognition accuracy on multiple datasets.
Enhanced interpretability through localization of actions and context.
Effective separation of action and context factors without detailed annotations.
Abstract
We propose a method for human action recognition, one that can localize the spatiotemporal regions that `define' the actions. This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual elements. To address this challenge, we utilize conjugate samples of human actions, which are video clips that are contextually similar to human action samples but do not contain the action. We introduce a novel attentional mechanism that can spatially and temporally separate human actions from the co-occurring contextual factors. The separation of the action and context factors is weakly supervised, eliminating the need for laboriously detailed annotation of these two factors in training samples. Our method can be used to build human action classifiers with higher accuracy and better interpretability. Experiments on several human action recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
