Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition
Xiaodan Liang, Liang Lin, Liangliang Cao

TL;DR
This paper introduces a novel spatio-temporal compositional model called STAOG for human action recognition, capturing complex action structures and interactions in videos, and employs a weakly supervised learning algorithm for training.
Contribution
The paper presents a new hierarchical spatio-temporal model with a weakly supervised learning approach for improved action recognition accuracy.
Findings
Outperforms existing methods on challenging datasets.
Effectively handles large intra-class variance.
Models complex spatio-temporal interactions.
Abstract
Action recognition is an important problem in multimedia understanding. This paper addresses this problem by building an expressive compositional action model. We model one action instance in the video with an ensemble of spatio-temporal compositions: a number of discrete temporal anchor frames, each of which is further decomposed to a layout of deformable parts. In this way, our model can identify a Spatio-Temporal And-Or Graph (STAOG) to represent the latent structure of actions e.g. triple jumping, swinging and high jumping. The STAOG model comprises four layers: (i) a batch of leaf-nodes in bottom for detecting various action parts within video patches; (ii) the or-nodes over bottom, i.e. switch variables to activate their children leaf-nodes for structural variability; (iii) the and-nodes within an anchor frame for verifying spatial composition; and (iv) the root-node at top for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
