EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge   Distillation and Modal-adaptive Pruning

Tiannan Wang; Wangchunshu Zhou; Yan Zeng; Xinsong Zhang

arXiv:2210.07795·cs.CL·October 17, 2022

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Tiannan Wang, Wangchunshu Zhou, Yan Zeng, Xinsong Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces EfficientVLM, a compact and fast vision-language model created through knowledge distillation and modal-adaptive pruning, achieving high accuracy with significantly fewer parameters and faster inference.

Contribution

The paper presents a novel framework combining knowledge distillation and adaptive pruning to efficiently compress and accelerate vision-language models while maintaining high performance.

Findings

01

EfficientVLM retains 98.4% of teacher performance.

02

Inference speed is increased by 2.2x.

03

Achieves state-of-the-art results among similar-sized models.

Abstract

Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and deployment in real-world applications due to space, memory, and latency constraints. In this work, we introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones. We first shrink the size of a pre-trained large VLM and apply knowledge distillation in the vision-language pre-training stage to obtain a task-agnostic compact VLM. Then we propose a modal-adaptive pruning algorithm to automatically infer the importance of vision and language modalities for different downstream tasks and adaptively remove redundant structures and neurons in different encoders with controllable target sparsity.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

swaggy-tn/efficientvlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation