BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT
Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, Hung-Yi Lee

TL;DR
This paper reveals that BERT's output layer can reconstruct input sentences from all hidden layers, and proposes a simple layer duplication method to enhance BERT's downstream task performance without additional training.
Contribution
It uncovers a surprising property of BERT's output layer and introduces a straightforward, training-free method to improve BERT's effectiveness by duplicating layers.
Findings
BERT's output layer can reconstruct input sentences from hidden layers
Duplicating layers in BERT improves downstream task performance
The method requires no additional training after duplication
Abstract
Although Bidirectional Encoder Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks, it remains a black box. A variety of previous works have tried to lift the veil of BERT and understand each layer's functionality. In this paper, we found that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input, even though the output layer has never seen the input other than the final hidden layer. This fact remains true across a wide variety of BERT-based models, even when some layers are duplicated. Based on this observation, we propose a quite simple method to boost the performance of BERT. By duplicating some layers in the BERT-based models to make it deeper (no extra training required in this step), they obtain better performance in the downstream tasks after…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax
