Dynamic Network Adaptation at Inference
Daniel Mendoza, Caroline Trippel

TL;DR
This paper introduces SLO-Aware Neural Networks that dynamically adjust computation during inference by dropping nodes, enabling real-time systems to meet diverse latency and accuracy requirements efficiently.
Contribution
It proposes a novel method for per-inference dynamic node dropout based on SLO targets, improving speed and maintaining accuracy in inference-serving systems.
Findings
Achieves 1.3-56.7× speedups with minimal accuracy loss
Serves multiple accuracy targets with a single trained model
Mitigates latency degradation from co-location interference
Abstract
Machine learning (ML) inference is a real-time workload that must comply with strict Service Level Objectives (SLOs), including latency and accuracy targets. Unfortunately, ensuring that SLOs are not violated in inference-serving systems is challenging due to inherent model accuracy-latency tradeoffs, SLO diversity across and within application domains, evolution of SLOs over time, unpredictable query patterns, and co-location interference. In this paper, we observe that neural networks exhibit high degrees of per-input activation sparsity during inference. . Thus, we propose SLO-Aware Neural Networks which dynamically drop out nodes per-inference query, thereby tuning the amount of computation performed, according to specified SLO optimization targets and machine utilization. SLO-Aware Neural Networks achieve average speedups of with little to no accuracy loss (less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Age of Information Optimization · Brain Tumor Detection and Classification
Methodstravel james
