Large Language Model-Enhanced Multi-Armed Bandits
Jiahang Sun, Zhiyong Wang, Runhan Yang, Chenjun Xiao, John C.S. Lui,, Zhongxiang Dai

TL;DR
This paper introduces a hybrid approach combining classical multi-armed bandit algorithms with large language models to improve decision-making, especially in complex tasks where direct LLM-based arm selection is suboptimal.
Contribution
The paper proposes integrating LLM-based reward prediction into classical MAB algorithms, including Thompson sampling and regression-based methods, with extensions to dueling bandits, outperforming previous direct LLM methods.
Findings
Our algorithms outperform baseline methods in synthetic and real-world text datasets.
The hybrid approach excels in tasks with less semantic clarity, outperforming direct LLM arm selection.
Incorporating LLMs with classical algorithms improves exploration-exploitation balance.
Abstract
Large language models (LLMs) have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper studies the application of LLMs in classical bandit problems, which is an interesting adaptation. The proposed methods borrowed ideas from bandits research (i.e., Thompson sampling and SquareCB), as scaffolding to facilitate high-level decision-making, while LLMs perform low-level predictions. From this perspective, the designs are theoretically grounded. Some experiments are conducted, showing that with the proposed scaffolding, the LLM + bandits approach is performing better than
- The overall message (or the purpose) of this paper is not very clear. If the target is to provide a scaffolding for LLM in decision-making tasks, more experiments should be performed (especially in real-world scenarios and may be not only bandit settings). It seems that the current target is only to help LLM play bandits, which is a very narrative topic and does not provide a sufficient amount of interests from my perspective. - If only doing LLM play bandits, there should be also more approa
+ It’s interesting to see an algorithm that integrates an LLM-based reward predictor into the classical Thompson sampling framework, which indeed manages the exploration-exploitation trade-off by starting with a high LLM temperature and gradually decaying it to promote exploitation as more data is gathered. + To my knowledge, the design of TS-LLM-DB adaptation is novel, which predicts pairwise preferences, approximates the Borda score to pick the first arm, and then selects a second arm to bala
- My first concern is the computational cost. The proposed methods (TS-LLM and RO-LLM) seem to require $K$ (the number of arms) separate LLM calls per iteration to get a predicted reward for each arm. The dueling bandit version (TS-LLM-DB) is even more expensive, requiring $(K * N) + K$ LLM calls per iteration. I wonder if this would be prohibitive for many real-world applications. - In addition, it seems that the algorithms feed the entire history of observations into the LLM prompt for in-con
1. The studied problem, applying LLMs in sequential decision-making tasks, is very interesting. 2. The authors propose a Thompson sampling (TS)-based algorithm which incorporates the LLM-based reward predictor into the classical TS algorithm, and further extend the proposed TS-based algorithm to dueling bandits which only use the preference feedback between pairs of arms. 3. This paper provides extensive experiments to demonstrate the empirical advantage of the proposed algorithms compared to th
1. There is no theoretical analysis provided in this paper. 2. This paper looks more like an experimental report of applying LLMs to the MAB tasks. The innovation in algorithm design and theoretical analysis is limited. In addition, the significance of the proposed algorithms is also not very clear, since they seem to be designed only for the application of LLMs in MAB tasks (which is an interesting but very particular problem) and compared only with direct arm selection using LLMs. Can the prop
- The paper proposes an intuitive paradigm that integrates classical MAB exploration mechanisms with LLM-based reward prediction. - The experiments cover diverse settings and demonstrate consistent gains over baselines across multiple LLMs, enhancing the generality of the findings. - The approach is plug-and-play, requiring no LLM fine-tuning, and the paper offers insights that are directly applicable to practitioners using black-box LLM APIs.
- Although the paper repeatedly motivates the use of Thompson Sampling and SquareCB, it does not provide any regret or convergence analysis—both of which are fundamental in the bandit literature. Even a simplified approximation or asymptotic argument would meaningfully strengthen the claim of a “principled integration.” - The evaluation primarily compares against prompt-engineering variants from Krishnamurthy et al. (2024), omitting many strong and relevant bandit methods as well as transfor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Machine Learning in Healthcare · Sentiment Analysis and Opinion Mining
