LV-BERT: Exploiting Layer Variety for BERT
Weihao Yu, Zihang Jiang, Fei Chen, Qibin Hou, Jiashi Feng

TL;DR
LV-BERT introduces layer variety by adding convolution layers and exploring new layer orders, using a supernet and evolutionary search to find superior architectures, resulting in improved performance over BERT variants.
Contribution
This paper proposes a novel approach to enhance BERT by exploiting layer type diversity and order, employing a supernet and evolutionary search for architecture optimization.
Findings
LV-BERT outperforms BERT and variants on downstream tasks.
LV-BERT-small achieves 79.8 on GLUE, surpassing ELECTRA-small.
Layer variety benefits pre-trained models.
Abstract
Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Weight Decay · Adam · Dropout · WordPiece · Layer Normalization · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay
