SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition

Soroush Oraki; Feng Ding; and Jie Liang

arXiv:2604.02222·cs.CV·April 3, 2026

SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition

Soroush Oraki, Feng Ding, and Jie Liang

PDF

TL;DR

SCALE introduces a novel energy-based framework using a conditional VAE for zero-shot skeleton action recognition, leveraging text semantics and confidence measures to improve class separation without generating samples.

Contribution

The paper proposes SCALE, a deterministic, text-conditioned VAE with a listwise energy loss and latent prototypes, enhancing zero-shot skeleton action recognition beyond prior methods.

Findings

01

SCALE outperforms previous VAE and alignment-based methods on NTU datasets.

02

The approach effectively separates semantically similar classes without sample generation.

03

Incorporating posterior uncertainty improves decision margins and handling ambiguous instances.

Abstract

Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.