Accelerating Large Language Model Inference with Self-Supervised Early Exits
Florian Valade

TL;DR
This paper introduces a modular early exit strategy for large language models, using self-supervised training of intermediate heads and confidence metrics to reduce inference costs while maintaining accuracy.
Contribution
It proposes a novel self-supervised training method for early exit heads and adapts the approach to dynamic speculative decoding, improving efficiency.
Findings
Entropy is the most reliable confidence metric.
Significant reduction in inference cost while maintaining accuracy.
DSSD achieves 1.66x higher token acceptance than baselines.
Abstract
This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's predictions, allowing computation to stop early when a calibrated confidence threshold is reached. We evaluate several confidence metrics and show that entropy provides the most reliable separation between correct and incorrect predictions. Experiments on the Pythia model suite (70M to 2.8B parameters) demonstrate that our method significantly reduces inference cost while maintaining accuracy across multiple benchmarks. We further adapt this approach to speculative decoding, introducing Dynamic Self-Speculative Decoding (DSSD), which achieves 1.66x higher token acceptance than manually-tuned LayerSkip baselines with minimal hyperparameter tuning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
