PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
Eunyeong Cho, Jehyeon Bang, Ranggi Hwang, Minsoo Rhu

TL;DR
PASCAL is a phase-aware scheduling algorithm designed to improve the serving efficiency of reasoning-based large language models by reducing time-to-first-token and balancing load during extended reasoning phases.
Contribution
It introduces a hierarchical, phase-aware scheduler that prioritizes reasoning phases and enables dynamic migration, addressing performance issues in LLM serving frameworks.
Findings
Reduces tail TTFT by up to 72% in benchmarks.
Maintains answering phase SLO attainment.
Effectively balances load and reduces interference during reasoning.
Abstract
The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Topic Modeling · Software System Performance and Reliability
