FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Chia-chi Hsieh; Zan Zong; Xinyang Chen; Jianjiang Li; Jidong Zhai; Lijie Wen

arXiv:2602.16603·cs.DC·February 19, 2026

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Chia-chi Hsieh, Zan Zong, Xinyang Chen, Jianjiang Li, Jidong Zhai, Lijie Wen

PDF

Open Access

TL;DR

FlowPrefill is a novel LLM serving system that decouples preemption granularity from scheduling frequency, significantly reducing head-of-line blocking and improving throughput and SLO compliance.

Contribution

It introduces operator-level preemption and event-driven scheduling to optimize prefill responsiveness and efficiency in LLM serving systems.

Findings

01

Maximum goodput increased by up to 5.6×.

02

Effectively mitigates head-of-line blocking.

03

Satisfies heterogeneous SLOs.

Abstract

The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Distributed systems and fault tolerance · Cloud Computing and Resource Management