Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

Jeonghyeok Do; Munchurl Kim

arXiv:2411.10745·cs.CV·July 17, 2025

Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

Jeonghyeok Do, Munchurl Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces a diffusion-based framework called TDSM for zero-shot skeleton-based action recognition, effectively aligning skeleton and text features to improve recognition of unseen actions.

Contribution

The paper proposes a novel diffusion-powered skeleton-text alignment method with a triplet diffusion loss, enhancing zero-shot recognition performance over prior approaches.

Findings

01

TDSM outperforms recent state-of-the-art methods with large accuracy margins.

02

The diffusion-based alignment improves generalization to unseen actions.

03

The triplet diffusion loss enhances discriminative skeleton-text matching.

Abstract

In zero-shot skeleton-based action recognition (ZSAR), aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. ZSAR faces a fundamental challenge in bridging the modality gap between the two-kind features, which severely limits generalization to unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated by the success of diffusion models in multi-modal alignment (e.g., text-to-image, text-to-video), we firstly present a diffusion-based skeleton-text alignment framework for ZSAR. Our approach, Triplet Diffusion for Skeleton-Text Matching (TDSM), focuses on cross-alignment power of diffusion models rather than their generative capability. Specifically, TDSM aligns skeleton features with text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KAIST-VICLab/TDSM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Natural Language Processing Techniques

MethodsTransformer · Diffusion · Latent Diffusion Model