VideoAdviser: Video Knowledge Distillation for Multimodal Transfer   Learning

Yanan Wang; Donghuo Zeng; Shinya Wada; Satoshi Kurihara

arXiv:2309.15494·cs.CV·September 28, 2023

VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning

Yanan Wang, Donghuo Zeng, Shinya Wada, Satoshi Kurihara

PDF

TL;DR

VideoAdviser introduces a knowledge distillation approach that transfers multimodal video knowledge from a powerful teacher model to a simpler student model, improving efficiency and performance in multimodal tasks like sentiment analysis and retrieval.

Contribution

The paper proposes a novel video knowledge distillation method that enables efficient multimodal transfer learning by distilling knowledge from a CLIP-based teacher to a RoBERTa-based student model.

Findings

01

Up to 12.3% MAE improvement in sentiment analysis

02

3.4% mAP increase in audio-visual retrieval

03

Enhanced state-of-the-art performance without extra inference costs

Abstract

Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. However, conventional systems are typically built on the assumption that all modalities exist, and the lack of modalities always leads to poor inference performance. Furthermore, extracting pretrained embeddings for all modalities is computationally inefficient for inference. In this work, to achieve high efficiency-performance multimodal transfer learning, we propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model (teacher) to a specific modal fundamental model (student). With an intuition that the best learning performance comes with professional advisers and smart students, we use a CLIP-based teacher model to provide expressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMasked autoencoder · Knowledge Distillation