LXMERT Model Compression for Visual Question Answering

Maryam Hashemi; Ghazaleh Mahmoudi; Sara Kodeiri; Hadi Sheikhi; Sauleh; Eetemadi

arXiv:2310.15325·cs.CV·October 25, 2023·1 cites

LXMERT Model Compression for Visual Question Answering

Maryam Hashemi, Ghazaleh Mahmoudi, Sara Kodeiri, Hadi Sheikhi, Sauleh, Eetemadi

PDF

Open Access 2 Repos

TL;DR

This paper investigates the existence of trainable subnetworks within LXMERT for visual question answering and demonstrates effective pruning of 40-60% with minimal accuracy loss.

Contribution

It combines the lottery ticket hypothesis with LXMERT fine-tuning to identify prunable subnetworks for VQA, providing a size reduction analysis.

Findings

01

LXMERT can be pruned by 40-60%

02

Pruning results in only 3% accuracy loss

03

Subnetworks exist within LXMERT for VQA

Abstract

Large-scale pretrained models such as LXMERT are becoming popular for learning cross-modal representations on text-image pairs for vision-language tasks. According to the lottery ticket hypothesis, NLP and computer vision models contain smaller subnetworks capable of being trained in isolation to full performance. In this paper, we combine these observations to evaluate whether such trainable subnetworks exist in LXMERT when fine-tuned on the VQA task. In addition, we perform a model size cost-benefit analysis by investigating how much pruning can be done without significant loss in accuracy. Our experiment results demonstrate that LXMERT can be effectively pruned by 40%-60% in size with 3% loss in accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsLearning Cross-Modality Encoder Representations from Transformers · Pruning