Tackling the Dynamicity in a Production LLM Serving System with SOTA   Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient   Meta-kernels

Mingcong Song; Xinru Tang; Fengfan Hou; Jing Li; Wei Wei; Yipeng Ma,; Runqiu Xiao; Hongjie Si; Dingcheng Jiang; Shouyi Yin; Yang Hu; Guoping Long

arXiv:2412.18106·cs.AI·December 30, 2024

Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma,, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long

PDF

Open Access

TL;DR

XY-Serve is a production LLM serving system that employs hybrid scheduling and meta-kernels to handle workload variability, achieving significant throughput improvements on Ascend NPUs and outperforming existing kernels.

Contribution

The paper introduces XY-Serve, a novel end-to-end LLM serving system with a workload smoothing abstraction and meta-kernels for attention and GEMM, optimized for tile-based architectures.

Findings

01

Up to 89% throughput improvement on Ascend NPUs.

02

GEMM kernels are on average 14.6% faster.

03

Attention kernels are on average 21.5% faster.

Abstract

Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Manufacturing and Logistics Optimization · Scheduling and Optimization Algorithms · Optimization and Search Problems

MethodsSoftmax · Attention Is All You Need