An Efficient Inference Framework for Early-exit Large Language Models

Ruijie Miao; Yihan Yan; Xinshuo Yao; Tong Yang

arXiv:2407.20272·cs.CL·July 31, 2024

An Efficient Inference Framework for Early-exit Large Language Models

Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang

PDF

Open Access

TL;DR

This paper introduces an efficient inference framework tailored for early-exit large language models, addressing unique challenges and achieving up to 1.25x speedup over traditional methods.

Contribution

It proposes novel batch inference and KV cache management techniques specifically designed for early-exit LLMs, filling a gap in existing inference frameworks.

Findings

01

Achieves up to 1.25x speedup compared to full-layer inference.

02

Introduces batch processing until all sequences meet confidence thresholds.

03

Develops KV cache management method for early-exit models.

Abstract

Building efficient inference framework has gained increasing interests for research community. Early-exit models, a variant of LLMs, improves the inference efficiency of LLMs by skipping rest layers and directly generate output tokens when they are confident enough. However, there is no work of LLM inference framework that takes early-exit models into consideration. This is non-trivial as prior art on LLM inference cannot be directly applied to early-exit models. In this work, we solves two key challenges in building efficient inference framework for early-exit models: (1) batch inference at iteration-level granularity; and (2) KV cache management. For the former, we propose to process the batch until all sequences surpass the early-exit confidence threshold. For the latter, we propose to fill the KV cache of rest layers before the iteration terminates. Our evaluation shows that,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings