ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding
Walaa Amer, Uday das, Fadi Kurdahi

TL;DR
ConfLayers introduces a confidence-based, adaptive layer skipping method for self-speculative decoding in large language models, achieving faster inference without quality loss.
Contribution
It presents a plug-and-play, confidence-driven approach to dynamically select layers to skip, eliminating the need for training a layer skipping policy.
Findings
Achieves up to 1.4x speedup over standard LLM generation.
Provides more consistent speed-quality trade-offs.
Adapts effectively to diverse tasks and datasets.
Abstract
Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model with selective re-evaluation by the full target model. Some existing methods form the draft model by dynamically learning which layers to skip during inference, effectively creating a smaller subnetwork to speed up computation. However, using heuristic-based approaches to select layers to skip can often be simpler and more effective. In this paper, we propose ConfLayers, a dynamic plug-and-play approach to forming the draft model in self-speculative decoding via confidence-based intermediate layer skipping. The process iteratively computes confidence scores for all layers, selects layers to skip based on an adaptive threshold, evaluates the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
