RelServe: Fast LLM Inference Serving on Relational Data
Xin Zhang, Shihong Gao, Yanyan Shen, Haoyang Li, Lei Chen

TL;DR
RelServe is a novel LLM inference engine designed for relQuery workloads on relational data, significantly reducing latency through dynamic prioritization and adaptive batching, enabling faster responses in AI-powered applications.
Contribution
It introduces a Dynamic Priority Updater and an Adaptive Batch Arranger to optimize LLM inference serving, addressing HoL blocking and latency trade-offs in relQuery workloads.
Findings
Up to 3.1x latency reduction compared to vLLM
Effective handling of concurrent relQuery workloads
Validated on four real-world datasets with large LLMs
Abstract
The use of Large Language Models (LLMs) for querying relational data has given rise to relQuery, a workload pattern that applies templated LLM calls to structured tables. As relQuery services become more widely adopted in applications such as AI-powered spreadsheets, fast response times under concurrent query loads are increasingly important. Unfortunately, current LLM engines face severe latency bottlenecks from Head-of-Line (HoL) blocking across three comparable inference phases: waiting, core running, and tail running. Existing static priority scheduling methods only address HoL blocking during the waiting phase, leaving two critical problems unsolved. First, the absence of a priority update mechanism causes inaccurate prioritization and continued HoL blocking during core execution. Second, suboptimal prefill-decode batching exacerbates HoL blocking in tail execution and worsens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Data Quality and Management · Software System Performance and Reliability
