MiniVLM: A Smaller and Faster Vision-Language Model

Jianfeng Wang; Xiaowei Hu; Pengchuan Zhang; Xiujun Li and; Lijuan Wang; Lei Zhang; Jianfeng Gao; Zicheng Liu

arXiv:2012.06946·cs.CV·August 11, 2021·29 cites

MiniVLM: A Smaller and Faster Vision-Language Model

Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li and, Lijuan Wang, Lei Zhang, Jianfeng Gao, Zicheng Liu

PDF

Open Access

TL;DR

MiniVLM is a compact and efficient vision-language model that significantly reduces computational costs while maintaining high accuracy, making it suitable for edge applications.

Contribution

The paper introduces MiniVLM, a lightweight VL model with a novel two-stage feature extractor and optimized transformer, achieving substantial speedups and size reduction with minimal accuracy loss.

Findings

01

Reduces model size by 73%

02

Decreases inference time by 94%

03

Retains 94-97% of accuracy on VL tasks

Abstract

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95%$ , compared to a baseline model. We adopt the MiniLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Pointwise Convolution · Depthwise Convolution · Depthwise Separable Convolution · Batch Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Warmup With Linear Decay · BiFPN · Attention Is All You Need · Layer Normalization