TL;DR
This paper introduces novel tensor-based representations for action recognition in videos, capturing complex spatial-temporal relationships and dynamics, leading to robust and efficient recognition across various video types.
Contribution
It proposes two tensor representations, SCK and DCK, with a generalization SCK(+) for local-global correlation capture, and introduces tensor normalization techniques for improved recognition.
Findings
Effective on 3D skeleton sequences
Improved fine-grained video recognition
Robustness across standard videos
Abstract
Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernel (DCK). SCK builds on the spatio-temporal correlations between features, whereas DCK explicitly models the action dynamics of a sequence. We also explore generalization of SCK, coined SCK(+), that operates on subsequences to capture the local-global interplay of correlations, which can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
