Token Management in Multi-Tenant AI Inference Platforms

William J. Cunningham

arXiv:2603.00356·cs.DC·March 3, 2026

Token Management in Multi-Tenant AI Inference Platforms

William J. Cunningham

PDF

Open Access

TL;DR

This paper introduces token pools, a novel resource management abstraction for multi-tenant AI inference platforms that improves resource utilization, guarantees, and fairness without modifying inference runtimes or schedulers.

Contribution

The paper proposes token pools as a new control-plane abstraction that enables explicit capacity management, fine-grained control, and priority-aware resource allocation in multi-tenant AI inference systems.

Findings

01

Token pools maintain bounded latency during overload conditions.

02

They enable debt-based fair-share convergence among elastic workloads.

03

Experiments show improved resource utilization and fairness.

Abstract

Multi-tenant AI inference platforms must balance resource utilization against service-level guarantees under variable demand. Conventional approaches fail to achieve this balance: dedicated endpoints strand capacity on idle models, while rate limits ignore the heterogeneous cost of inference requests. We introduce \emph{token pools}, a control-plane abstraction that represents inference capacity as explicit entitlements expressed in inference-native units (token throughput, KV cache, concurrency). Unlike rate limits, which govern request admission without regard to execution cost, token pools authorize both admission and autoscaling from the same capacity model, ensuring consistency between what is promised and what is provisioned. The abstraction captures burst modes across multiple dimensions invisible to conventional throttling. Dynamic per-entitlement limits on each burst dimension…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Software-Defined Networks and 5G