Bilaterally Slimmable Transformer for Elastic and Efficient Visual   Question Answering

Zhou Yu; Zitian Jin; Jun Yu; Mingliang Xu; Hongbo Wang; Jianping Fan

arXiv:2203.12814·cs.CV·May 15, 2023·1 cites

Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering

Zhou Yu, Zitian Jin, Jun Yu, Mingliang Xu, Hongbo Wang, Jianping Fan

PDF

Open Access 1 Repo

TL;DR

This paper introduces the bilaterally slimmable Transformer (BST), a flexible framework enabling the creation of efficient, scalable VQA models that can adapt to different hardware constraints without retraining.

Contribution

The paper proposes BST, a novel method for training a single Transformer model that can be dynamically slimmed into submodels of various sizes for efficient VQA deployment.

Findings

01

A slimmed MCAN-BST model achieves comparable accuracy to the original with reduced size and FLOPs.

02

The smallest MCAN-BST model runs on mobile devices with less than 60 ms latency.

03

BST can be integrated with multiple Transformer-based VQA models, demonstrating its generality.

Abstract

Recent advances in Transformer architectures [1] have brought remarkable improvements to visual question answering (VQA). Nevertheless, Transformer-based VQA models are usually deep and wide to guarantee good performance, so they can only run on powerful GPU servers and cannot run on capacity-restricted platforms such as mobile phones. Therefore, it is desirable to learn an elastic VQA model that supports adaptive pruning at runtime to meet the efficiency constraints of different platforms. To this end, we present the bilaterally slimmable Transformer (BST), a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. To verify the effectiveness and generality of this method, we integrate the proposed BST framework with three typical Transformer-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

milvlg/bst
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Softmax · Label Smoothing