Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
Qingyi Si, Yuanxin Liu, Zheng Lin, Peng Fu, Weiping Wang

TL;DR
This paper explores the joint compression and debiasing of vision-language pre-trained models for visual question answering, demonstrating the existence of sparse, robust subnetworks that outperform debiased full models on out-of-distribution datasets.
Contribution
It introduces a systematic approach to simultaneously compress and debias VLPs by searching for sparse, robust subnetworks tailored for VQA tasks.
Findings
Existence of sparse, robust subnetworks in VLPs.
Sparse subnetworks outperform debiased full models on OOD datasets.
Proposed method achieves competitive results with fewer parameters.
Abstract
Despite the excellent performance of vision-language pre-trained models (VLPs) on conventional VQA task, they still suffer from two problems: First, VLPs tend to rely on language biases in datasets and fail to generalize to out-of-distribution (OOD) data. Second, they are inefficient in terms of memory footprint and computation. Although promising progress has been made in both problems, most existing works tackle them independently. To facilitate the application of VLP to VQA tasks, it is imperative to jointly study VLP compression and OOD robustness, which, however, has not yet been explored. This paper investigates whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. To this end, we systematically study the design of a training and compression pipeline to search the subnetworks, as well as the assignment of sparsity to different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
Methodsfail · Learning Cross-Modality Encoder Representations from Transformers
