
TL;DR
This paper introduces a method to detect unactivated layers, called Voids, in transformer-based language models during inference, revealing that selectively skipping layers can improve task performance.
Contribution
The paper adapts L2 Adaptive Computation to identify Voids in LMs, demonstrating that many layers are inactive and that skipping them can enhance model accuracy.
Findings
Skipping Voids improves model performance on benchmarks.
Different layers activate during prompt processing and response generation.
Selective layer skipping reduces computational load while maintaining accuracy.
Abstract
Despite advances in transformer-based language models (LMs), a fundamental question remains largely unanswered: Are all layers activated during inference? We investigate this question by detecting unactivated layers (which we refer to as Voids) using a non-trainable and parameter-free adaptive computation method called L2 Adaptive Computation (LAC). We adapt LAC from its original efficiency-focused application to trace activated layers during inference. This method monitors changes in the L2-norm of activations to identify voids. We analyze layer activation in instruction-tuned LMs across two phases: Prompt Processing (PP), where we trace activated layers for each token in the input prompts, and Response Generation (RG), where we trace activated layers for each generated token. We further demonstrate that distinct layers are activated during these two phases. To show the effectiveness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
