CLIP-guided Prototype Modulating for Few-shot Action Recognition
Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli, Zhao, Nong Sang

TL;DR
This paper introduces CLIP-FSAR, a framework that leverages CLIP's multimodal knowledge to improve few-shot action recognition by refining visual prototypes with semantic priors and contrastive learning.
Contribution
It proposes a novel CLIP-guided prototype modulation method combining contrastive learning and prototype refinement for low-data action recognition tasks.
Findings
Significantly outperforms existing methods on five benchmarks.
Effectively utilizes CLIP's semantic priors for prototype refinement.
Achieves robust few-shot classification with limited data.
Abstract
Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Dropout
