Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP
Yating Yu, Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, Yanning, Zhang

TL;DR
This paper introduces STDD, a CLIP-based framework that enhances zero-shot action recognition by capturing multi-modal spatiotemporal dynamics through innovative attention and text augmentation, outperforming existing methods on key benchmarks.
Contribution
The work proposes a novel spatiotemporal dynamic framework with space-time cross attention and an action semantic knowledge graph for improved zero-shot action recognition.
Findings
Outperforms state-of-the-art on Kinetics-600, UCF101, HMDB51
Effectively captures spatiotemporal dynamics without extra parameters
Enhances generalization through aligned video and text representations
Abstract
Zero-shot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance, given its inherent constraints in capturing essential temporal dynamics from both vision and text perspectives, especially when encountering novel actions with fine-grained spatiotemporal discrepancies. In this work, we propose Spatiotemporal Dynamic Duo (STDD), a novel CLIP-based framework to comprehend multi-modal spatiotemporal dynamics synergistically. For the vision side, we propose an efficient Space-time Cross Attention, which captures spatiotemporal dynamics flexibly with simple yet effective operations applied before and after spatial attention, without adding additional parameters or increasing computational complexity. For the semantic side, we conduct spatiotemporal text augmentation by comprehensively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
