Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies
Kyoungmin Kim, Jiacheng Li, Kijae Hong, Anastasia Ailamaki

TL;DR
This paper introduces database-inspired preemption and cache replacement techniques to significantly improve the speed and efficiency of large language model inference on GPUs, enabling better resource management.
Contribution
It develops a novel cost model and cache replacement policy for LLM inference, adapting database techniques to optimize GPU resource utilization and request scheduling.
Findings
Substantial GPU cost savings achieved.
Enhanced inference request scheduling efficiency.
Improved inference throughput and latency.
Abstract
LLMs are increasingly used world-wide from daily tasks to agentic systems and data analytics, requiring significant GPU resources. LLM inference systems, however, are slow compared to database systems, and inference performance and mechanism have been often regarded as a black box, limiting the expansion of the use of LLMs inside databases and other performance-critical applications. This paper first analyzes the LLM inference performance and focuses on a data management issue inside LLM inference. We find that inference systems lack an adequate resource cost model and optimization strategy to schedule requests with their intermediate results in a cache reside in GPU memory when executing multiple concurrent inference requests. We adapt classic database techniques by building cost models for concurrent inference requests and a new cache replacement policy tailored for LLM inference,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems
