LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model

Ruosi Wang; Fangwei Zuo; Lei Li; Zhaoqiang Xia

arXiv:2603.27103·cs.CV·March 31, 2026

LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model

Ruosi Wang, Fangwei Zuo, Lei Li, Zhaoqiang Xia

PDF

TL;DR

This paper introduces HocSLM, a hierarchical skeleton-language model that enhances action recognition by integrating global-local spatio-temporal modeling with semantic-rich textual descriptions, achieving state-of-the-art results.

Contribution

The paper proposes a novel hierarchical global-local network combined with a large vision-language model and a sequential fusion module for improved semantic understanding in action recognition.

Findings

01

Achieves state-of-the-art performance on NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA datasets.

02

Enhances cross-modal semantic alignment between skeletal data and textual descriptions.

03

Improves modeling of complex spatio-temporal relationships in skeleton-based action recognition.

Abstract

Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.