Accelerating Large Language Model Inference with Self-Supervised Early Exits

Florian Valade

arXiv:2407.21082·cs.CL·February 13, 2026

Accelerating Large Language Model Inference with Self-Supervised Early Exits

Florian Valade

PDF

Open Access

TL;DR

This paper introduces a modular early exit strategy for large language models, using self-supervised training of intermediate heads and confidence metrics to reduce inference costs while maintaining accuracy.

Contribution

It proposes a novel self-supervised training method for early exit heads and adapts the approach to dynamic speculative decoding, improving efficiency.

Findings

01

Entropy is the most reliable confidence metric.

02

Significant reduction in inference cost while maintaining accuracy.

03

DSSD achieves 1.66x higher token acceptance than baselines.

Abstract

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's predictions, allowing computation to stop early when a calibrated confidence threshold is reached. We evaluate several confidence metrics and show that entropy provides the most reliable separation between correct and incorrect predictions. Experiments on the Pythia model suite (70M to 2.8B parameters) demonstrate that our method significantly reduces inference cost while maintaining accuracy across multiple benchmarks. We further adapt this approach to speculative decoding, introducing Dynamic Self-Speculative Decoding (DSSD), which achieves 1.66x higher token acceptance than manually-tuned LayerSkip baselines with minimal hyperparameter tuning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis