Bidirectional Cross-Modal Knowledge Exploration for Video Recognition   with Pre-trained Vision-Language Models

Wenhao Wu; Xiaohan Wang; Haipeng Luo; Jingdong Wang; Yi Yang; Wanli; Ouyang

arXiv:2301.00182·cs.CV·March 28, 2023·6 cites

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli, Ouyang

PDF

Open Access 5 Repos

TL;DR

This paper introduces BIKE, a novel framework that leverages pre-trained vision-language models to enhance video recognition by exploring bidirectional cross-modal knowledge, achieving state-of-the-art results across multiple datasets.

Contribution

The paper proposes a new framework called BIKE that utilizes cross-modal bridges to explore bidirectional knowledge for improved video recognition using pre-trained VLMs.

Findings

01

Achieves 88.6% accuracy on Kinetics-400 with CLIP.

02

Outperforms existing methods in zero-shot and few-shot recognition.

03

Demonstrates effectiveness across six popular video datasets.

Abstract

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training