Distilled Dual-Encoder Model for Vision-Language Understanding
Zekun Wang, Wenhui Wang, Haichao Zhu, Ming Liu, Bing Qin, Furu Wei

TL;DR
This paper introduces a cross-modal attention distillation framework to enhance dual-encoder models for vision-language tasks, achieving a balance between fast inference and complex understanding.
Contribution
It presents a novel distillation method that transfers deep interaction knowledge from fusion-encoder models to dual-encoders, improving their performance.
Findings
Achieves competitive results on visual reasoning, entailment, and VQA tasks.
Maintains faster inference speed compared to fusion-encoder models.
Improves dual-encoder performance through combined pre-training and fine-tuning distillation.
Abstract
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
