Learning to Skip the Middle Layers of Transformers
Tim Lawson, Laurence Aitchison

TL;DR
This paper introduces a dynamic skipping mechanism for middle layers in Transformers, aiming to improve efficiency by selectively bypassing redundant layers based on input, guided by interpretability insights.
Contribution
It proposes a novel architecture that dynamically skips middle layers in Transformers using learned gating and attention mechanisms, inspired by redundancy and information aggregation insights.
Findings
No significant improvement in validation cross-entropy vs. FLOPs trade-off.
The approach effectively reduces computation for simpler tokens.
Code released for reproducibility and further research.
Abstract
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tim-lawson/skip-middle-fineweb-baseline-12-layersmodel· 3 dl3 dl
- 🤗tim-lawson/skip-middle-fineweb-baseline-4-layersmodel· 3 dl3 dl
- 🤗tim-lawson/skip-middle-fineweb-baseline-8-layersmodel· 1 dl1 dl
- 🤗tim-lawson/skip-middle-fineweb-baseline-2-layersmodel· 4 dl4 dl
- 🤗tim-lawson/skip-middle-fineweb-baseline-6-layersmodel· 3 dl3 dl
- 🤗tim-lawson/skip-middle-fineweb-baseline-10-layersmodel· 3 dl3 dl
- 🤗tim-lawson/skip-middle-fineweb-nocontrol-8-layersmodel· 5 dl5 dl
- 🤗tim-lawson/skip-middle-fineweb-nocontrol-10-layersmodel· 1 dl1 dl
- 🤗tim-lawson/skip-middle-fineweb-nocontrol-2-layersmodel· 1 dl1 dl
- 🤗tim-lawson/skip-middle-fineweb-nocontrol-4-layersmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArt, Technology, and Culture
