Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads
Chenyu Gao, Qi Zhu, Peng Wang, Qi Wu

TL;DR
This paper investigates the roles of individual heads and layers in VisualBERT for VQA, revealing their specialization for different question types, and proposes a dynamic chopping method to reduce model size with minimal accuracy loss.
Contribution
It introduces a systematic analysis of Transformer heads and layers in VisualBERT for VQA and proposes a dynamic chopping module to optimize model efficiency.
Findings
Different heads and layers are responsible for different question types.
Higher-level layers are activated by higher-level visual reasoning questions.
The dynamic chopping module reduces parameters by 50% with less than 1% accuracy loss.
Abstract
Vision-and-Language (VL) pre-training has shown great potential on many related downstream tasks, such as Visual Question Answering (VQA), one of the most popular problems in the VL field. All of these pre-trained models (such as VisualBERT, ViLBERT, LXMERT and UNITER) are built with Transformer, which extends the classical attention mechanism to multiple layers and heads. To investigate why and how these models work on VQA so well, in this paper we explore the roles of individual heads and layers in Transformer models when handling different types of questions. Specifically, we manually remove (chop) heads (or layers) from a pre-trained VisualBERT model at a time, and test it on different levels of questions to record its performance. As shown in the interesting echelon shape of the result matrices, experiments reveal different heads and layers are responsible for different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Linear Layer · Learning Cross-Modality Encoder Representations from Transformers · VisualBERT · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Adam · Softmax
