Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
Nathan Ng, Walid A. Hanafy, Prashanthi Kadambi, Balachandra Sunil, Ayush Gupta, David Irwin, Yogesh Simmhan, Prashant Shenoy

TL;DR
SwapLess is an adaptive system that optimizes multi-tenant inference on memory-limited Edge TPUs by dynamically partitioning workloads to significantly reduce latency.
Contribution
It introduces an analytic queueing model and an online adjustment mechanism for efficient, multi-tenant TPU-CPU collaborative inference on constrained edge devices.
Findings
Reduces mean latency by up to 63.8% for single-tenant workloads.
Achieves up to 77.4% latency reduction for multi-tenant workloads.
Demonstrates effectiveness on Edge TPU platforms.
Abstract
IoT applications increasingly rely on on-device AI accelerators to ensure high performance, especially in low-connectivity and safety-critical scenarios. However, the limited on-chip memory of these accelerators forces inference runtimes to swap model segments between host and accelerator memory, incurring significant swapping overheads. While collaborative processing by partitioning model execution across CPU and accelerator resources can reduce accelerator memory pressure and execution overhead, naive partitioning may worsen end-to-end latency by either shifting excessive computation to the CPU or failing to sufficiently reduce swapping, a problem that is further exacerbated in multi-tenant and dynamic environments. To address these issues, we present SwapLess, a system for adaptive, multi-tenant TPU-CPU collaborative inference on memory-constrained Edge TPUs. SwapLess utilizes an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
