Building a Multi-modal Spatiotemporal Expert for Zero-shot Action   Recognition with CLIP

Yating Yu; Congqi Cao; Yueran Zhang; Qinyi Lv; Lingtong Min; Yanning; Zhang

arXiv:2412.09895·cs.CV·February 11, 2025·2 cites

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

Yating Yu, Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, Yanning, Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces STDD, a CLIP-based framework that enhances zero-shot action recognition by capturing multi-modal spatiotemporal dynamics through innovative attention and text augmentation, outperforming existing methods on key benchmarks.

Contribution

The work proposes a novel spatiotemporal dynamic framework with space-time cross attention and an action semantic knowledge graph for improved zero-shot action recognition.

Findings

01

Outperforms state-of-the-art on Kinetics-600, UCF101, HMDB51

02

Effectively captures spatiotemporal dynamics without extra parameters

03

Enhances generalization through aligned video and text representations

Abstract

Zero-shot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance, given its inherent constraints in capturing essential temporal dynamics from both vision and text perspectives, especially when encountering novel actions with fine-grained spatiotemporal discrepancies. In this work, we propose Spatiotemporal Dynamic Duo (STDD), a novel CLIP-based framework to comprehend multi-modal spatiotemporal dynamics synergistically. For the vision side, we propose an efficient Space-time Cross Attention, which captures spatiotemporal dynamics flexibly with simple yet effective operations applied before and after spatial attention, without adding additional parameters or increasing computational complexity. For the semantic side, we conduct spatiotemporal text augmentation by comprehensively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mia-yatingyu/stdd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training