Not All Layers of LLMs Are Necessary During Inference

Siqi Fan; Xin Jiang; Xiang Li; Xuying Meng; Peng Han; Shuo Shang,; Aixin Sun; Yequan Wang; Zhongyuan Wang

arXiv:2403.02181·cs.CL·July 10, 2024·1 cites

Not All Layers of LLMs Are Necessary During Inference

Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang,, Aixin Sun, Yequan Wang, Zhongyuan Wang

PDF

Open Access

TL;DR

This paper introduces AdaInfer, an adaptive method to terminate LLM inference early based on intermediate layer outputs, significantly reducing computational costs while maintaining accuracy.

Contribution

The paper presents AdaInfer, a simple algorithm that predicts the optimal inference layer to cut off, reducing resource use without retraining or modifying LLMs.

Findings

01

Achieves up to 43% inference pruning on sentiment tasks

02

Maintains less than 1% performance drop across tasks

03

Works with popular LLMs like Llama2 and OPT

Abstract

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCancer-related molecular mechanisms research · Cell Adhesion Molecules Research · Cancer-related gene regulation

MethodsSupport Vector Machine · OPT · Pruning