TL;DR
Loki is a system that combines hardware and accuracy scaling to improve ML inference pipeline serving, significantly increasing capacity and reducing violations compared to existing methods.
Contribution
Loki introduces a novel framework and query routing algorithm for integrated hardware and accuracy scaling in inference pipelines.
Findings
Effective capacity increased by over 2.7x with accuracy scaling.
Loki reduces SLO violations by up to 10x compared to state-of-the-art.
Maintains high accuracy with minimal compromises.
Abstract
The rapid adoption of machine learning (ML) has underscored the importance of serving ML models with high throughput and resource efficiency. Traditional approaches to managing increasing query demands have predominantly focused on hardware scaling, which involves increasing server count or computing power. However, this strategy can often be impractical due to limitations in the available budget or compute resources. As an alternative, accuracy scaling offers a promising solution by adjusting the accuracy of ML models to accommodate fluctuating query demands. Yet, existing accuracy scaling techniques target independent ML models and tend to underperform while managing inference pipelines. Furthermore, they lack integration with hardware scaling, leading to potential resource inefficiencies during low-demand periods. To address the limitations, this paper introduces Loki, a system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
