Not All Layers of LLMs Are Necessary During Inference
Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang,, Aixin Sun, Yequan Wang, Zhongyuan Wang

TL;DR
This paper introduces AdaInfer, an adaptive method to terminate LLM inference early based on intermediate layer outputs, significantly reducing computational costs while maintaining accuracy.
Contribution
The paper presents AdaInfer, a simple algorithm that predicts the optimal inference layer to cut off, reducing resource use without retraining or modifying LLMs.
Findings
Achieves up to 43% inference pruning on sentiment tasks
Maintains less than 1% performance drop across tasks
Works with popular LLMs like Llama2 and OPT
Abstract
Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCancer-related molecular mechanisms research · Cell Adhesion Molecules Research · Cancer-related gene regulation
MethodsSupport Vector Machine · OPT · Pruning
