CLIP-guided Prototype Modulating for Few-shot Action Recognition

Xiang Wang; Shiwei Zhang; Jun Cen; Changxin Gao; Yingya Zhang; Deli; Zhao; Nong Sang

arXiv:2303.02982·cs.CV·October 28, 2024·1 cites

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli, Zhao, Nong Sang

PDF

Open Access 1 Repo

TL;DR

This paper introduces CLIP-FSAR, a framework that leverages CLIP's multimodal knowledge to improve few-shot action recognition by refining visual prototypes with semantic priors and contrastive learning.

Contribution

It proposes a novel CLIP-guided prototype modulation method combining contrastive learning and prototype refinement for low-data action recognition tasks.

Findings

01

Significantly outperforms existing methods on five benchmarks.

02

Effectively utilizes CLIP's semantic priors for prototype refinement.

03

Achieves robust few-shot classification with limited data.

Abstract

Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba-mmai-research/clip-fsar
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Dropout