Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition

Wenhan Wu; Zhishuai Guo; Chen Chen; Hongfei Xue; Aidong Lu

arXiv:2506.22179·cs.CV·June 30, 2025

Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition

Wenhan Wu, Zhishuai Guo, Chen Chen, Hongfei Xue, Aidong Lu

PDF

Open Access

TL;DR

This paper introduces FS-VAE, a novel model that enhances zero-shot skeleton-based action recognition by integrating frequency decomposition and multi-level semantic alignment to better capture fine-grained action details.

Contribution

The paper proposes a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) with frequency-based modules and calibrated cross-alignment loss for improved zero-shot action recognition.

Findings

01

Enhanced semantic features improve action differentiation.

02

Frequency decomposition boosts robustness in recognition.

03

Effective alignment reduces semantic ambiguity.

Abstract

Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Domain Adaptation and Few-Shot Learning