Not all layers are equally as important: Every Layer Counts BERT

Lucas Georges Gabriel Charpentier; David Samuel

arXiv:2311.02265·cs.CL·November 9, 2023·2 cites

Not all layers are equally as important: Every Layer Counts BERT

Lucas Georges Gabriel Charpentier, David Samuel

PDF

Open Access

TL;DR

This paper presents a transformer modification enabling each layer to select previous outputs, demonstrating that not all layers are equally important, which improves data-efficient language model pretraining.

Contribution

It introduces a layer-wise selection mechanism in transformers, showing that different layers contribute variably to model performance.

Findings

01

Our approach won the BabyLM challenge tracks.

02

Not all transformer layers are equally important.

03

Layer selection improves data efficiency.

Abstract

This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the strict and strict-small tracks. Our approach allows each transformer layer to select which outputs of previous layers to process. The empirical results verify the potential of this simple modification and show that not all layers are equally as important.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis