Temporal-Spatial Mapping for Action Recognition
Xiaolin Song, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jingyu Yang and, Xiaoyan Sun

TL;DR
This paper introduces Temporal-Spatial Mapping (TSM), a novel method for capturing temporal and spatial dynamics in videos, leading to improved human action recognition performance.
Contribution
The paper proposes a new VideoMap representation and a shallow CNN with temporal attention, advancing video action recognition accuracy.
Findings
Achieves 4.2% higher accuracy than TSN on HMDB51.
Introduces a simple, effective operation for modeling temporal evolution.
Demonstrates state-of-the-art performance on benchmark dataset.
Abstract
Deep learning models have enjoyed great success for image related computer vision tasks like image classification and object detection. For video related tasks like human action recognition, however, the advancements are not as significant yet. The main challenge is the lack of effective and efficient models in modeling the rich temporal spatial information in a video. We introduce a simple yet effective operation, termed Temporal-Spatial Mapping (TSM), for capturing the temporal evolution of the frames by jointly analyzing all the frames of a video. We propose a video level 2D feature representation by transforming the convolutional features of all frames to a 2D feature map, referred to as VideoMap. With each row being the vectorized feature representation of a frame, the temporal-spatial features are compactly represented, while the temporal dynamic evolution is also well embedded.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Multimodal Machine Learning Applications
