Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

Chengyi Nie; Rodrigo Fonseca; Zhenhua Liu

arXiv:2405.06856·cs.DC·May 14, 2024·1 cites

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu

PDF

Open Access 3 Reviews

TL;DR

Aladdin is a cluster-level scheduler for LLM inference that co-optimizes query placement and resource scaling to meet SLOs efficiently, significantly reducing costs.

Contribution

It introduces a novel SLO-aware scheduling approach that jointly manages query placement and resource scaling for LLM serving at the cluster level.

Findings

01

Reduces serving costs by up to 71% for the same SLO.

02

Effectively predicts minimal resources needed for SLO compliance.

03

Improves resource utilization compared to baseline methods.

Abstract

The demand for large language model (LLM) inference is gradually dominating the artificial intelligence workloads. Therefore, there is an urgent need for cost-efficient inference serving. Existing work focuses on single-worker optimization and lacks consideration of cluster-level management for both inference queries and computing resources. However, placing requests and managing resources without considering the query features easily causes SLO violations or resource underutilization. Providers are forced to allocate extra computing resources to guarantee user experience, leading to additional serving costs. In this paper we introduce Aladdin, a scheduler that co-adaptively places queries and scales computing resources with SLO awareness. For a stream of inference queries, Aladdin first predicts minimal computing resources and the corresponding serving workers' configuration required…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

1. This paper addresses a real problem. The assumptions are realistic. 2. The method jointly optimize request placement and resource scaling under probabilistic SLO constraints, while existing works treated these problems independently. 3. The method adapts to dynamic workloads with provable efficiency. 4. This work bridges theory and LLM system practice. It has potential for industry deployments after some improvements.

Weaknesses

I personally like the idea of this paper. The only weakness I see is that the evaluations are limited. It would be better if the authors consider add more evaluations, such as varying different workload, to simulate real-world scenarios better.

Reviewer 02Rating 4Confidence 3

Strengths

The paper is well structured and clearly presented, with a logical flow that makes the main arguments easy to follow. The proposed approach appears to be useful in a Service Level Objective (SLO) context, offering practical insights that could benefit both researchers and practitioners. The authors provide sufficient motivation and background, and the organization of the sections enhances readability.

Weaknesses

(1) The paper mentions that the testbed is limited to A100 or V100 GPUs. However, in practice, the total number of GPUs in a cluster is large. Therefore, the small scale of the physical testbed seems unconvincing. Moreover, the paper needs to better justify why a simulation can represent a real-world setting. (2) Using only a few GPUs in the experiments does not make the "cluster-level" claim convincing. (3) Heterogeneous GPUs: Did the experiments consider heterogeneous GPU environments? It ap

Reviewer 03Rating 4Confidence 4

Strengths

1. Instead of pursuing a complex and high-overhead predictor, the authors adopt a simple "historical average" estimator, arguing that its "unbiased" nature allows errors to partially cancel out within a batch. This, combined with the proposed re-balancing mechanism, is an interesting and practical design choice. 2. The paper presents an end-to-end system that jointly optimizes worker configuration and request placement. Formulating the placement task as an online multi-dimensional bin packing pr

Weaknesses

1. The paper's structure could be improved. The first paragraph of the introduction is long; it should be focused on defining the problem. The review of recent methods and research gaps can be the second paragraph. Figure 1 consumes a large amount of vertical space; a horizontal layout would be more space-efficient. 2. Section 2 is overly long. It should be simplified, with non-essential details moved to the appendix. Furthermore, Section 2 and Section 3 are interconnected and could be combined

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Agent-Based Network Management · Service-Oriented Architecture and Web Services · Power Systems and Technologies