Dynamic layer selection in decoder-only transformers

Theodore Glavas; Joud Chataoui; Florence Regol; Wassim Jabbour,; Antonios Valkanas; Boris N. Oreshkin; Mark Coates

arXiv:2410.20022·cs.CL·October 29, 2024

Dynamic layer selection in decoder-only transformers

Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour,, Antonios Valkanas, Boris N. Oreshkin, Mark Coates

PDF

Open Access 1 Repo

TL;DR

This paper investigates dynamic inference methods for decoder-only transformers, revealing that layer skipping is more robust than early exiting and demonstrating potential for significant efficiency gains with optimized layer allocation.

Contribution

It provides an empirical comparison of layer skipping and early exiting, and introduces an oracle controller for dynamic layer allocation in decoder-only models.

Findings

01

Layer skipping is more robust than early exit.

02

Dynamic layer allocation can match full model performance with only 23.3% of layers.

03

Constructing an oracle controller enables significant efficiency improvements.

Abstract

The vast size of Large Language Models (LLMs) has prompted a search to optimize inference. One effective approach is dynamic inference, which adapts the architecture to the sample-at-hand to reduce the overall computational cost. We empirically examine two common dynamic inference methods for natural language generation (NLG): layer skipping and early exiting. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit. We demonstrate the difficulty of using hidden state information to adapt computation on a per-token basis for layer skipping. Finally, we show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller. Remarkably, we find that there exists an allocation which achieves equal performance to the full model using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

networkslab/enlsp_neurips24
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInduction Heating and Inverter Technology · Advanced Data Compression Techniques · Neural Networks and Applications