Distilled Dual-Encoder Model for Vision-Language Understanding

Zekun Wang; Wenhui Wang; Haichao Zhu; Ming Liu; Bing Qin; Furu Wei

arXiv:2112.08723·cs.CL·October 18, 2022·1 cites

Distilled Dual-Encoder Model for Vision-Language Understanding

Zekun Wang, Wenhui Wang, Haichao Zhu, Ming Liu, Bing Qin, Furu Wei

PDF

Open Access 4 Repos

TL;DR

This paper introduces a cross-modal attention distillation framework to enhance dual-encoder models for vision-language tasks, achieving a balance between fast inference and complex understanding.

Contribution

It presents a novel distillation method that transfers deep interaction knowledge from fusion-encoder models to dual-encoders, improving their performance.

Findings

01

Achieves competitive results on visual reasoning, entailment, and VQA tasks.

02

Maintains faster inference speed compared to fusion-encoder models.

03

Improves dual-encoder performance through combined pre-training and fine-tuning distillation.

Abstract

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings