Action Keypoint Network for Efficient Video Recognition
Xu Chen, Yahong Han, Xiaohan Wang, Yifan Sun, Yi Yang

TL;DR
This paper introduces AK-Net, a novel video recognition model that efficiently selects spatial-temporal keypoints and transforms video data into point clouds, improving both accuracy and computational efficiency.
Contribution
The paper proposes integrating spatial and temporal keypoint selection with point cloud classification, addressing limitations of fixed-shape cropping and independent selection methods.
Findings
Improves recognition accuracy on multiple benchmarks.
Reduces computational cost compared to baseline models.
Effectively captures diverse informative regions in videos.
Abstract
Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes, while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · 3D Shape Modeling and Analysis · 3D Surveying and Cultural Heritage
