MiniVLM: A Smaller and Faster Vision-Language Model
Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li and, Lijuan Wang, Lei Zhang, Jianfeng Gao, Zicheng Liu

TL;DR
MiniVLM is a compact and efficient vision-language model that significantly reduces computational costs while maintaining high accuracy, making it suitable for edge applications.
Contribution
The paper introduces MiniVLM, a lightweight VL model with a novel two-stage feature extractor and optimized transformer, achieving substantial speedups and size reduction with minimal accuracy loss.
Findings
Reduces model size by 73%
Decreases inference time by 94%
Retains 94-97% of accuracy on VL tasks
Abstract
Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by , compared to a baseline model. We adopt the MiniLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Pointwise Convolution · Depthwise Convolution · Depthwise Separable Convolution · Batch Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Warmup With Linear Decay · BiFPN · Attention Is All You Need · Layer Normalization
