SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition

Qilang Ye; Yu Zhou; Lian He; Jie Zhang; Xuanming Guo; Jiayu Zhang; Mingkui Tan; Weicheng Xie; Yue Sun; Tao Tan; Xiaochen Yuan; Ghada Khoriba; Zitong Yu

arXiv:2511.10091·cs.CV·November 14, 2025

SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition

Qilang Ye, Yu Zhou, Lian He, Jie Zhang, Xuanming Guo, Jiayu Zhang, Mingkui Tan, Weicheng Xie, Yue Sun, Tao Tan, Xiaochen Yuan, Ghada Khoriba, Zitong Yu

PDF

Open Access 1 Video

TL;DR

This paper introduces SUGAR, a novel framework that combines large language models with visual and motion knowledge from video models to improve skeleton-based action recognition and description, including zero-shot scenarios.

Contribution

SUGAR is the first to integrate LLMs with visual-motion priors for skeleton representation learning, enhancing action recognition and description capabilities.

Findings

01

Effective skeleton representation learning via visual-motion supervision

02

Improved accuracy on skeleton-based action classification benchmarks

03

Enhanced zero-shot action recognition performance

Abstract

Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Human Motion and Animation