Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers
Emre Ozbas, Melih Bastopcu

TL;DR
This paper develops an optimization framework for allocating internal tokens in an LLM server to balance accuracy and latency, ensuring queue stability and maximizing overall performance.
Contribution
It introduces a novel concave optimization model for token allocation in LLM servers, with a fixed-point characterization and convergence guarantees for the solution.
Findings
Optimal token allocation maximizes accuracy-latency trade-off.
The proposed method guarantees queue stability and convergence.
Simulation shows effective performance of the allocation strategy.
Abstract
We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to distinct task types. Queries arrive according to a Poisson process, and each type occurs with a known prior probability. For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query. The token allocation induces an accuracy-latency trade-off: the service time follows an approximately affine function of the allocated tokens, while the probability of a correct response exhibits diminishing returns. Under a first-in, first-out (FIFO) service discipline, the system operates as an queue, and the mean system time depends on the first and second moments of the resulting service-time distribution. We formulate a constrained optimization problem that maximizes a weighted average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Cloud Computing and Resource Management · Distributed systems and fault tolerance
