Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma,, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long

TL;DR
XY-Serve is a production LLM serving system that employs hybrid scheduling and meta-kernels to handle workload variability, achieving significant throughput improvements on Ascend NPUs and outperforming existing kernels.
Contribution
The paper introduces XY-Serve, a novel end-to-end LLM serving system with a workload smoothing abstraction and meta-kernels for attention and GEMM, optimized for tile-based architectures.
Findings
Up to 89% throughput improvement on Ascend NPUs.
GEMM kernels are on average 14.6% faster.
Attention kernels are on average 21.5% faster.
Abstract
Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Manufacturing and Logistics Optimization · Scheduling and Optimization Algorithms · Optimization and Search Problems
MethodsSoftmax · Attention Is All You Need
