Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition
Yiming Wang, Frederick W. B. Li, Jingyun Wang

TL;DR
This paper introduces a novel zero-shot video action recognition framework that leverages motion separation and semantic alignment with positive and negative prompts to improve performance over prior CLIP-based methods.
Contribution
It proposes a motion-guided semantic alignment approach with disentangled embeddings and negative prompts, enhancing zero-shot recognition accuracy.
Findings
Outperforms prior CLIP-based methods on standard benchmarks.
Achieves robust zero-shot recognition across coarse and fine-grained datasets.
Effectively models non-class semantics using negative prompts.
Abstract
Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model "non-class" semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
