Vision-Language Meets the Skeleton: Progressively Distillation with   Cross-Modal Knowledge for 3D Action Representation Learning

Yang Chen; Tian He; Junfeng Fu; Ling Wang; Jingcai Guo; Ting Hu; Hong; Cheng

arXiv:2405.20606·cs.CV·September 17, 2024

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, Hong, Cheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel cross-modal contrastive learning framework that leverages vision-language knowledge prompts and progressive distillation to improve skeleton-based 3D action recognition, achieving state-of-the-art results.

Contribution

The proposed C²VL framework uses vision-language prompts and progressive distillation to learn task-agnostic skeleton action representations without requiring action labels.

Findings

01

Outperforms previous methods on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets.

02

Achieves state-of-the-art accuracy in skeleton-based action recognition.

03

Effectively integrates vision-language knowledge to enrich skeleton representations.

Abstract

Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C $^{2}$ VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cseeyangchen/c2vl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems

MethodsContrastive Learning