DLIP: Distilling Language-Image Pre-training

Huafeng Kuang; Jie Wu; Xiawu Zheng; Ming Li; Xuefeng Xiao; Rui Wang,; Min Zheng; Rongrong Ji

arXiv:2308.12956·cs.CV·August 25, 2023·2 cites

DLIP: Distilling Language-Image Pre-training

Huafeng Kuang, Jie Wu, Xiawu Zheng, Ming Li, Xuefeng Xiao, Rui Wang,, Min Zheng, Rongrong Ji

PDF

Open Access

TL;DR

DLIP introduces a novel framework for distilling large vision-language pre-trained models into smaller, efficient models, maintaining high performance across tasks like retrieval, captioning, and VQA, with significant parameter reduction.

Contribution

The paper provides a comprehensive analysis and practical guidelines for VLP model distillation, achieving state-of-the-art efficiency and accuracy trade-offs.

Findings

01

DLIP compresses BLIP by 1.9x with comparable performance.

02

Retains over 95% of performance with only 22.4% parameters.

03

Speeds up inference by 2.7x.

Abstract

Vision-Language Pre-training (VLP) shows remarkable progress with the assistance of extremely heavy parameters, which challenges deployment in real applications. Knowledge distillation is well recognized as the essential procedure in model compression. However, existing knowledge distillation techniques lack an in-depth investigation and analysis of VLP, and practical guidelines for VLP-oriented distillation are still not yet explored. In this paper, we present DLIP, a simple yet efficient Distilling Language-Image Pre-training framework, through which we investigate how to distill a light VLP model. Specifically, we dissect the model distillation from multiple dimensions, such as the architecture characteristics of different modules and the information transfer of different modalities. We conduct comprehensive experiments and provide insights on distilling a light but performant VLP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsBLIP: Bootstrapping Language-Image Pre-training · Knowledge Distillation · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings