EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse

Tianyu Guo; Hande Dong; Yichong Leng; Feng Liu; Cheater Lin; Nong Xiao; Xianwei Zhang

arXiv:2505.21889·cs.CL·May 30, 2025

EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse

Tianyu Guo, Hande Dong, Yichong Leng, Feng Liu, Cheater Lin, Nong Xiao, Xianwei Zhang

PDF

Open Access 1 Repo

TL;DR

EFIM introduces a transformed prompt format and fragment tokenization to enhance KV cache reuse in LLM infilling tasks, significantly reducing latency and increasing throughput without sacrificing performance.

Contribution

The paper proposes EFIM, a novel prompt transformation and fragment tokenization method that improves KV cache reuse efficiency in LLM infilling tasks.

Findings

01

Latency reduced by 52%

02

Throughput increased by 98%

03

Maintains original infilling capability

Abstract

Large language models (LLMs) are often used for infilling tasks, which involve predicting or generating missing information in a given text. These tasks typically require multiple interactions with similar context. To reduce the computation of repeated historical tokens, cross-request key-value (KV) cache reuse, a technique that stores and reuses intermediate computations, has become a crucial method in multi-round interactive services. However, in infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format, which typically consists of a prefix and suffix relative to the insertion point. Specifically, the KV cache of the prefix or suffix part is frequently invalidated as the other part (suffix or prefix) is incrementally generated. To address the issue, we propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gty111/efim
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Big Data and Digital Economy