Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework
Xiaodong Chen, Xinchen Liu, Wu Liu, Kun Liu, Dong Wu, Yongdong Zhang,, Tao Mei

TL;DR
This paper introduces a pose-guided, coarse-to-fine framework for Part-level Action Parsing that predicts both video-level actions and frame-level body part actions, achieving state-of-the-art results on Kinetics-TPS.
Contribution
It proposes a novel pose-guided positional embedding and segment-level feature recognition for accurate, explainable part-level action parsing in videos.
Findings
Achieves 31.10% ROC score on Kinetics-TPS
Outperforms existing methods significantly
Balances accuracy and computation effectively
Abstract
Action recognition from videos, i.e., classifying a video into one of the pre-defined action types, has been a popular topic in the communities of artificial intelligence, multimedia, and signal processing. However, existing methods usually consider an input video as a whole and learn models, e.g., Convolutional Neural Networks (CNNs), with coarse video-level class labels. These methods can only output an action class for the video, but cannot provide fine-grained and explainable cues to answer why the video shows a specific action. Therefore, researchers start to focus on a new task, Part-level Action Parsing (PAP), which aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video. To this end, we propose a coarse-to-fine framework for this challenging task. In particular, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
