Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
Zhou Yu, Zitian Jin, Jun Yu, Mingliang Xu, Hongbo Wang, Jianping Fan

TL;DR
This paper introduces the bilaterally slimmable Transformer (BST), a flexible framework enabling the creation of efficient, scalable VQA models that can adapt to different hardware constraints without retraining.
Contribution
The paper proposes BST, a novel method for training a single Transformer model that can be dynamically slimmed into submodels of various sizes for efficient VQA deployment.
Findings
A slimmed MCAN-BST model achieves comparable accuracy to the original with reduced size and FLOPs.
The smallest MCAN-BST model runs on mobile devices with less than 60 ms latency.
BST can be integrated with multiple Transformer-based VQA models, demonstrating its generality.
Abstract
Recent advances in Transformer architectures [1] have brought remarkable improvements to visual question answering (VQA). Nevertheless, Transformer-based VQA models are usually deep and wide to guarantee good performance, so they can only run on powerful GPU servers and cannot run on capacity-restricted platforms such as mobile phones. Therefore, it is desirable to learn an elastic VQA model that supports adaptive pruning at runtime to meet the efficiency constraints of different platforms. To this end, we present the bilaterally slimmable Transformer (BST), a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. To verify the effectiveness and generality of this method, we integrate the proposed BST framework with three typical Transformer-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Softmax · Label Smoothing
