On the Impact of White-box Deployment Strategies for Edge AI on Latency and Model Performance
Jaskirat Singh, Bram Adams, Ahmed E. Hassan

TL;DR
This study empirically compares white-box and black-box deployment strategies for Edge AI, analyzing their accuracy-latency trade-offs across different models and tiers to guide MLOps engineers.
Contribution
It provides a comprehensive empirical assessment of various deployment operators and their combinations, highlighting effective strategies for latency and accuracy trade-offs in Edge AI.
Findings
Distillation combined with SPTQ (DSPTQ) offers lower latency with small accuracy loss.
Distilled operators outperform partitioning in resource-constrained tiers.
Cloud deployment is preferable for low-input-size models, while Edge is better for high-input-size models.
Abstract
To help MLOps engineers decide which operator to use in which deployment scenario, this study aims to empirically assess the accuracy vs latency trade-off of white-box (training-based) and black-box operators (non-training-based) and their combinations in an Edge AI setup. We perform inference experiments including 3 white-box (i.e., QAT, Pruning, Knowledge Distillation), 2 black-box (i.e., Partition, SPTQ), and their combined operators (i.e., Distilled SPTQ, SPTQ Partition) across 3 tiers (i.e., Mobile, Edge, Cloud) on 4 commonly-used Computer Vision and Natural Language Processing models to identify the effective strategies, considering the perspective of MLOps Engineers. Our Results indicate that the combination of Distillation and SPTQ operators (i.e., DSPTQ) should be preferred over non-hybrid operators when lower latency is required in the edge at small to medium accuracy drop.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing
MethodsPruning
