PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

Eunyeong Cho; Jehyeon Bang; Ranggi Hwang; Minsoo Rhu

arXiv:2602.11530·cs.LG·February 13, 2026

PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

Eunyeong Cho, Jehyeon Bang, Ranggi Hwang, Minsoo Rhu

PDF

Open Access

TL;DR

PASCAL is a phase-aware scheduling algorithm designed to improve the serving efficiency of reasoning-based large language models by reducing time-to-first-token and balancing load during extended reasoning phases.

Contribution

It introduces a hierarchical, phase-aware scheduler that prioritizes reasoning phases and enables dynamic migration, addressing performance issues in LLM serving frameworks.

Findings

01

Reduces tail TTFT by up to 72% in benchmarks.

02

Maintains answering phase SLO attainment.

03

Effectively balances load and reduces interference during reasoning.

Abstract

The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Topic Modeling · Software System Performance and Reliability