Compressing Visual-linguistic Model via Knowledge Distillation

Zhiyuan Fang; Jianfeng Wang; Xiaowei Hu; Lijuan Wang; Yezhou Yang,; Zicheng Liu

arXiv:2104.02096·cs.CV·April 7, 2021

Compressing Visual-linguistic Model via Knowledge Distillation

Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang,, Zicheng Liu

PDF

Open Access

TL;DR

This paper introduces a knowledge distillation method to effectively compress large visual-linguistic models into smaller ones by aligning their representations despite differences in visual region proposals.

Contribution

It proposes a novel distillation approach that aligns hidden states and attention distributions using shared region proposals, improving small VL model performance.

Findings

01

Achieves 120.8 CIDEr score on COCO captioning, surpassing baseline by 5.1

02

Attains 69.8 accuracy on VQA 2.0, 0.8 higher than baseline

03

Demonstrates effectiveness in both pre-training and fine-tuning stages

Abstract

Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer-based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsKnowledge Distillation