Loki: A System for Serving ML Inference Pipelines with Hardware and   Accuracy Scaling

Sohaib Ahmad; Hui Guan; Ramesh K. Sitaraman

arXiv:2407.03583·cs.DC·July 8, 2024

Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling

Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman

PDF

1 Repo

TL;DR

Loki is a system that combines hardware and accuracy scaling to improve ML inference pipeline serving, significantly increasing capacity and reducing violations compared to existing methods.

Contribution

Loki introduces a novel framework and query routing algorithm for integrated hardware and accuracy scaling in inference pipelines.

Findings

01

Effective capacity increased by over 2.7x with accuracy scaling.

02

Loki reduces SLO violations by up to 10x compared to state-of-the-art.

03

Maintains high accuracy with minimal compromises.

Abstract

The rapid adoption of machine learning (ML) has underscored the importance of serving ML models with high throughput and resource efficiency. Traditional approaches to managing increasing query demands have predominantly focused on hardware scaling, which involves increasing server count or computing power. However, this strategy can often be impractical due to limitations in the available budget or compute resources. As an alternative, accuracy scaling offers a promising solution by adjusting the accuracy of ML models to accommodate fluctuating query demands. Yet, existing accuracy scaling techniques target independent ML models and tend to underperform while managing inference pipelines. Furthermore, they lack integration with hardware scaling, leading to potential resource inefficiencies during low-demand periods. To address the limitations, this paper introduces Loki, a system…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UMass-LIDS/Loki
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.