Not all layers are equally as important: Every Layer Counts BERT
Lucas Georges Gabriel Charpentier, David Samuel

TL;DR
This paper presents a transformer modification enabling each layer to select previous outputs, demonstrating that not all layers are equally important, which improves data-efficient language model pretraining.
Contribution
It introduces a layer-wise selection mechanism in transformers, showing that different layers contribute variably to model performance.
Findings
Our approach won the BabyLM challenge tracks.
Not all transformer layers are equally important.
Layer selection improves data efficiency.
Abstract
This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the strict and strict-small tracks. Our approach allows each transformer layer to select which outputs of previous layers to process. The empirical results verify the potential of this simple modification and show that not all layers are equally as important.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
