TL;DR
This paper introduces a new framework for scheduling in queueing systems that learns service rates using contextual information, providing regret bounds and algorithms for both stochastic and adversarial settings.
Contribution
It develops the concept of contextual queueing bandits, proposes novel algorithms with proven regret bounds, and introduces a queue length regret decomposition framework for analysis.
Findings
CQB-ε achieves a regret upper bound of (T^{-1/4})
CQB-Opt achieves a regret upper bound of O((\log^2 T)) in adversarial contexts
Experimental results validate the theoretical regret bounds
Abstract
We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated…
Peer Reviews
Decision·ICLR 2026 Poster
The strengths of the paper are summarized below. - The paper introduces a novel contextual queueing bandit framework in which arriving jobs have distinct contextual features. The problem is well-motivated by practical applications such as personalized recommendation systems and LLM inference workloads. -The paper addresses the analytical challenges arising from queue state mismatch, which complicates the regret analysis in queueing-based decision processes. - The work extends beyond the stoch
The weaknesses of the paper are given below. - The main contribution lies in extending existing queueing bandit frameworks to the heterogeneous contextual setting, but many of the key algorithmic and analytical techniques are adapted from prior work rather than fundamentally novel. - The algorithmic design of the proposed methods follows standard extensions of existing bandit algorithms. While the regret analysis is mathematically interesting, it does not lead to new insights in algorithmic d
1. The studied problem, contextual queueing bandits, is interesting and novel, which can be applied to various scenarios such as job scheduling and wireless networks. This problem may attract attention in both online learning and queueing theory communities. 2. The authors design two algorithms, i.e., CQB-$\varepsilon$, which achieves a regret upper bound of $O(T^{ −1/4})$ for the regular contextual setting, and CQB-Opt for the adversarially chosen context setting, which achieves a regret upper
1. This paper should discuss and compare with the following work, which studies RL with queueing states, in formulation and results. For example, can the formulation of the following work encompass the formulation of this paper? Yashaswini Murthy, Isaac Grosof, Siva Theja Maguluri, and R. Srikant. Performance of NPG in countable state-space average-cost RL. arXiv preprint arXiv:2405.20467 (2024). 2. The proposed algorithm, CQB-$\varepsilon$, uses the explore-then-exploit (ETE) strategy, which
1. The paper extends classical queueing bandits to a contextual setting, where job features directly affect service rates. The metric of queue length regret is a non-trivial and operationally meaningful objective compared to standard cumulative reward regret. 2. The writing is good. The queue dynamics, logistic service model, and filtration definitions are clearly presented. The results for both stochastic and adversarial contexts are also clearly separated. 3. The policy-switching queue cons
1. The logistic service rate model needs to be further examined and discussed. 2. Although the coupling-based decomposition is theoretically interesting, the operational meaning of the $O(T^{-1/4})$ scaling could be clarified. For example, what does vanishing queue length difference imply in real scheduling terms? 3. The length setup for the pure exploration phase requires prior knowledge of the traffic slackness parameter. It would be valuable to discuss whether this dependence can be relaxe
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Age of Information Optimization · Caching and Content Delivery
