Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers

Emre Ozbas; Melih Bastopcu

arXiv:2601.10274·cs.LG·January 16, 2026

Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers

Emre Ozbas, Melih Bastopcu

PDF

Open Access

TL;DR

This paper develops an optimization framework for allocating internal tokens in an LLM server to balance accuracy and latency, ensuring queue stability and maximizing overall performance.

Contribution

It introduces a novel concave optimization model for token allocation in LLM servers, with a fixed-point characterization and convergence guarantees for the solution.

Findings

01

Optimal token allocation maximizes accuracy-latency trade-off.

02

The proposed method guarantees queue stability and convergence.

03

Simulation shows effective performance of the allocation strategy.

Abstract

We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types. Queries arrive according to a Poisson process, and each type occurs with a known prior probability. For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query. The token allocation induces an accuracy-latency trade-off: the service time follows an approximately affine function of the allocated tokens, while the probability of a correct response exhibits diminishing returns. Under a first-in, first-out (FIFO) service discipline, the system operates as an $M / G /1$ queue, and the mean system time depends on the first and second moments of the resulting service-time distribution. We formulate a constrained optimization problem that maximizes a weighted average…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Cloud Computing and Resource Management · Distributed systems and fault tolerance