Few-shot Action Recognition with Captioning Foundation Models

Xiang Wang; Shiwei Zhang; Hangjie Yuan; Yingya Zhang; Changxin Gao,; Deli Zhao; Nong Sang

arXiv:2310.10125·cs.CV·October 17, 2023·5 cites

Few-shot Action Recognition with Captioning Foundation Models

Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao,, Deli Zhao, Nong Sang

PDF

Open Access

TL;DR

CapFSAR leverages captioning foundation models to automatically generate textual descriptions from videos, enabling effective multimodal few-shot action recognition without manual annotation, and achieves state-of-the-art results.

Contribution

Introduces CapFSAR, a plug-and-play framework that exploits multimodal foundation models for few-shot action recognition without manual text annotation.

Findings

01

Outperforms existing methods on multiple benchmarks.

02

Effectively utilizes automatically generated captions for recognition.

03

Achieves state-of-the-art performance in few-shot settings.

Abstract

Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Residual Connection · Absolute Position Encodings · Layer Normalization · Dense Connections · Linear Layer · Adam · Position-Wise Feed-Forward Layer