BERT's output layer recognizes all hidden layers? Some Intriguing   Phenomena and a simple way to boost BERT

Wei-Tsung Kao; Tsung-Han Wu; Po-Han Chi; Chun-Cheng Hsieh; Hung-Yi Lee

arXiv:2001.09309·cs.CL·February 16, 2021·6 cites

BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT

Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, Hung-Yi Lee

PDF

Open Access

TL;DR

This paper reveals that BERT's output layer can reconstruct input sentences from all hidden layers, and proposes a simple layer duplication method to enhance BERT's downstream task performance without additional training.

Contribution

It uncovers a surprising property of BERT's output layer and introduces a straightforward, training-free method to improve BERT's effectiveness by duplicating layers.

Findings

01

BERT's output layer can reconstruct input sentences from hidden layers

02

Duplicating layers in BERT improves downstream task performance

03

The method requires no additional training after duplication

Abstract

Although Bidirectional Encoder Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks, it remains a black box. A variety of previous works have tried to lift the veil of BERT and understand each layer's functionality. In this paper, we found that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input, even though the output layer has never seen the input other than the final hidden layer. This fact remains true across a wide variety of BERT-based models, even when some layers are duplicated. Based on this observation, we propose a quite simple method to boost the performance of BERT. By duplicating some layers in the BERT-based models to make it deeper (no extra training required in this step), they obtain better performance in the downstream tasks after…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax