Learning Representative Temporal Features for Action Recognition
Ali Javidani, Ahmad Mahmoudi-Aznaveh

TL;DR
This paper introduces a lightweight video classification approach that extracts temporal features using a 1D-CNN on PCA-reduced features from pre-trained CNNs, enabling effective recognition with limited training data.
Contribution
The novel approach combines pre-trained spatial features with a 1D-CNN for temporal classification, reducing training complexity and data requirements.
Findings
Achieved state-of-the-art results on UCF11 and jHMDB datasets.
Performed competitively on HMDB51 dataset.
Reduced training parameters significantly.
Abstract
In this paper, a novel video classification method is presented that aims to recognize different categories of third-person videos efficiently. Our motivation is to achieve a light model that could be trained with insufficient training data. With this intuition, the processing of the 3-dimensional video input is broken to 1D in temporal dimension on top of the 2D in spatial. The processes related to 2D spatial frames are being done by utilizing pre-trained networks with no training phase. The only step which involves training is to classify the 1D time series resulted from the description of the 2D signals. As a matter of fact, optical flow images are first calculated from consecutive frames and described by pre-trained CNN networks. Their dimension is then reduced using PCA. By stacking the description vectors beside each other, a multi-channel time series is created for each video.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
