IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency
Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania, Lorido-Botran, Lin Wang, Joseph Doyle, Pooyan Jamshidi

TL;DR
IPA is an online system that dynamically adapts inference pipelines by selecting model variants and configuring parameters to optimize accuracy and cost while meeting latency SLAs, demonstrated to improve accuracy significantly with minimal cost increase.
Contribution
The paper introduces IPA, a novel online adaptation system that efficiently manages model variants and configuration parameters to optimize multi-objective inference trade-offs in real-time.
Findings
Improves end-to-end accuracy by up to 21%.
Achieves better cost-accuracy trade-offs than existing methods.
Demonstrates effectiveness in real-world Kubernetes pipelines.
Abstract
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Neural Network Applications · Radiation Effects in Electronics
Methodstravel james · OPT
