LV-BERT: Exploiting Layer Variety for BERT

Weihao Yu; Zihang Jiang; Fei Chen; Qibin Hou; Jiashi Feng

arXiv:2106.11740·cs.CL·June 28, 2021

LV-BERT: Exploiting Layer Variety for BERT

Weihao Yu, Zihang Jiang, Fei Chen, Qibin Hou, Jiashi Feng

PDF

Open Access 1 Repo

TL;DR

LV-BERT introduces layer variety by adding convolution layers and exploring new layer orders, using a supernet and evolutionary search to find superior architectures, resulting in improved performance over BERT variants.

Contribution

This paper proposes a novel approach to enhance BERT by exploiting layer type diversity and order, employing a supernet and evolutionary search for architecture optimization.

Findings

01

LV-BERT outperforms BERT and variants on downstream tasks.

02

LV-BERT-small achieves 79.8 on GLUE, surpassing ELECTRA-small.

03

Layer variety benefits pre-trained models.

Abstract

Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuweihao/LV-BERT
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Weight Decay · Adam · Dropout · WordPiece · Layer Normalization · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay