Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies

Kyoungmin Kim; Jiacheng Li; Kijae Hong; Anastasia Ailamaki

arXiv:2411.07447·cs.PF·October 3, 2025·2 cites

Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies

Kyoungmin Kim, Jiacheng Li, Kijae Hong, Anastasia Ailamaki

PDF

Open Access

TL;DR

This paper introduces database-inspired preemption and cache replacement techniques to significantly improve the speed and efficiency of large language model inference on GPUs, enabling better resource management.

Contribution

It develops a novel cost model and cache replacement policy for LLM inference, adapting database techniques to optimize GPU resource utilization and request scheduling.

Findings

01

Substantial GPU cost savings achieved.

02

Enhanced inference request scheduling efficiency.

03

Improved inference throughput and latency.

Abstract

LLMs are increasingly used world-wide from daily tasks to agentic systems and data analytics, requiring significant GPU resources. LLM inference systems, however, are slow compared to database systems, and inference performance and mechanism have been often regarded as a black box, limiting the expansion of the use of LLMs inside databases and other performance-critical applications. This paper first analyzes the LLM inference performance and focuses on a data management issue inside LLM inference. We find that inference systems lack an adequate resource cost model and optimization strategy to schedule requests with their intermediate results in a cache reside in GPU memory when executing multiple concurrent inference requests. We adapt classic database techniques by building cost models for concurrent inference requests and a new cache replacement policy tailored for LLM inference,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems