SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition
Qilang Ye, Yu Zhou, Lian He, Jie Zhang, Xuanming Guo, Jiayu Zhang, Mingkui Tan, Weicheng Xie, Yue Sun, Tao Tan, Xiaochen Yuan, Ghada Khoriba, Zitong Yu

TL;DR
This paper introduces SUGAR, a novel framework that combines large language models with visual and motion knowledge from video models to improve skeleton-based action recognition and description, including zero-shot scenarios.
Contribution
SUGAR is the first to integrate LLMs with visual-motion priors for skeleton representation learning, enhancing action recognition and description capabilities.
Findings
Effective skeleton representation learning via visual-motion supervision
Improved accuracy on skeleton-based action classification benchmarks
Enhanced zero-shot action recognition performance
Abstract
Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Human Motion and Animation
