Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition
Jeonghyeok Do, Munchurl Kim

TL;DR
This paper introduces a diffusion-based framework called TDSM for zero-shot skeleton-based action recognition, effectively aligning skeleton and text features to improve recognition of unseen actions.
Contribution
The paper proposes a novel diffusion-powered skeleton-text alignment method with a triplet diffusion loss, enhancing zero-shot recognition performance over prior approaches.
Findings
TDSM outperforms recent state-of-the-art methods with large accuracy margins.
The diffusion-based alignment improves generalization to unseen actions.
The triplet diffusion loss enhances discriminative skeleton-text matching.
Abstract
In zero-shot skeleton-based action recognition (ZSAR), aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. ZSAR faces a fundamental challenge in bridging the modality gap between the two-kind features, which severely limits generalization to unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated by the success of diffusion models in multi-modal alignment (e.g., text-to-image, text-to-video), we firstly present a diffusion-based skeleton-text alignment framework for ZSAR. Our approach, Triplet Diffusion for Skeleton-Text Matching (TDSM), focuses on cross-alignment power of diffusion models rather than their generative capability. Specifically, TDSM aligns skeleton features with text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Natural Language Processing Techniques
MethodsTransformer · Diffusion · Latent Diffusion Model
