VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning
Yanan Wang, Donghuo Zeng, Shinya Wada, Satoshi Kurihara

TL;DR
VideoAdviser introduces a knowledge distillation approach that transfers multimodal video knowledge from a powerful teacher model to a simpler student model, improving efficiency and performance in multimodal tasks like sentiment analysis and retrieval.
Contribution
The paper proposes a novel video knowledge distillation method that enables efficient multimodal transfer learning by distilling knowledge from a CLIP-based teacher to a RoBERTa-based student model.
Findings
Up to 12.3% MAE improvement in sentiment analysis
3.4% mAP increase in audio-visual retrieval
Enhanced state-of-the-art performance without extra inference costs
Abstract
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. However, conventional systems are typically built on the assumption that all modalities exist, and the lack of modalities always leads to poor inference performance. Furthermore, extracting pretrained embeddings for all modalities is computationally inefficient for inference. In this work, to achieve high efficiency-performance multimodal transfer learning, we propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model (teacher) to a specific modal fundamental model (student). With an intuition that the best learning performance comes with professional advisers and smart students, we use a CLIP-based teacher model to provide expressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMasked autoencoder · Knowledge Distillation
