Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning
Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, Hong, Cheng

TL;DR
This paper introduces a novel cross-modal contrastive learning framework that leverages vision-language knowledge prompts and progressive distillation to improve skeleton-based 3D action recognition, achieving state-of-the-art results.
Contribution
The proposed C²VL framework uses vision-language prompts and progressive distillation to learn task-agnostic skeleton action representations without requiring action labels.
Findings
Outperforms previous methods on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets.
Achieves state-of-the-art accuracy in skeleton-based action recognition.
Effectively integrates vision-language knowledge to enrich skeleton representations.
Abstract
Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (CVL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems
MethodsContrastive Learning
