Few-shot Action Recognition with Captioning Foundation Models
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao,, Deli Zhao, Nong Sang

TL;DR
CapFSAR leverages captioning foundation models to automatically generate textual descriptions from videos, enabling effective multimodal few-shot action recognition without manual annotation, and achieves state-of-the-art results.
Contribution
Introduces CapFSAR, a plug-and-play framework that exploits multimodal foundation models for few-shot action recognition without manual text annotation.
Findings
Outperforms existing methods on multiple benchmarks.
Effectively utilizes automatically generated captions for recognition.
Achieves state-of-the-art performance in few-shot settings.
Abstract
Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Residual Connection · Absolute Position Encodings · Layer Normalization · Dense Connections · Linear Layer · Adam · Position-Wise Feed-Forward Layer
