Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Yehna Kim; Young-Eun Kim; Seong-Whan Lee

arXiv:2510.27255·cs.CV·November 4, 2025

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Yehna Kim, Young-Eun Kim, Seong-Whan Lee

PDF

Open Access

TL;DR

This paper introduces a novel approach for zero-shot action recognition that uses web-crawled descriptions and a spatio-temporal module to improve semantic understanding and reduce manual annotation effort, achieving state-of-the-art results.

Contribution

It proposes leveraging large-language models to extract description attributes from web data and a spatio-temporal interaction module to enhance video understanding in zero-shot settings.

Findings

01

Achieved 81.0% accuracy on UCF-101

02

Achieved 53.1% accuracy on HMDB-51

03

Achieved 68.9% accuracy on Kinetics-600

Abstract

Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning