A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro
Beomyeol Jeon, Chen Wang, Diana Arroyo, Alaa Youssef, Indranil Gupta

TL;DR
This paper introduces Faro, a system for managing on-premises ML inference clusters that dynamically allocates resources to meet latency SLOs, using a novel approach to balance optimization speed and accuracy.
Contribution
Faro's novel 'sloppify' approach relaxes utility functions and predictors, enabling fast, adaptive resource allocation in ML inference clusters.
Findings
Faro reduces SLO violations by up to 23 times.
Faro adapts quickly to workload changes.
Faro outperforms existing systems in cluster management.
Abstract
This paper tackles the challenge of running multiple ML inference jobs (models) under time-varying workloads, on a constrained on-premises production cluster. Our system Faro takes in latency Service Level Objectives (SLOs) for each job, auto-distills them into utility functions, "sloppifies" these utility functions to make them amenable to mathematical optimization, automatically predicts workload via probabilistic prediction, and dynamically makes implicit cross-job resource allocations, in order to satisfy cluster-wide objectives, e.g., total utility, fairness, and other hybrid variants. A major challenge Faro tackles is that using precise utilities and high-fidelity predictors, can be too slow (and in a sense too precise!) for the fast adaptation we require. Faro's solution is to "sloppify" (relax) its multiple design components to achieve fast adaptation without overly degrading…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
