Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents

Fanzeng Xia; Hao Liu; Yisong Yue; Tongxin Li

arXiv:2407.01887·cs.LG·June 10, 2025

Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents

Fanzeng Xia, Hao Liu, Yisong Yue, Tongxin Li

PDF

Open Access 1 Video

TL;DR

This paper explores the zero-shot decision-making capabilities of Large Language Models in dueling bandits, introduces an agentic framework called LEAD to improve performance, and provides theoretical and empirical validation of its effectiveness.

Contribution

It demonstrates LLMs' potential in cross-domain in-context reinforcement learning for dueling bandits and proposes LEAD, a novel framework integrating classic algorithms with LLMs for improved regret performance.

Findings

01

LLMs exhibit low short-term regret in dueling bandits without training.

02

LEAD framework inherits theoretical guarantees from classic algorithms.

03

LEAD is robust to noisy and adversarial prompts.

Abstract

In-Context Reinforcement Learning (ICRL) is a frontier paradigm to solve Reinforcement Learning (RL) problems in the foundation model era. While ICRL capabilities have been demonstrated in transformers through task-specific training, the potential of Large Language Models (LLMs) out-of-the-box remains largely unexplored. This paper investigates whether LLMs can generalize cross-domain to perform ICRL under the problem of Dueling Bandits (DB), a stateless preference-based RL setting. We find that the top-performing LLMs exhibit a notable zero-shot capacity for relative decision-making, which translates to low short-term weak regret across all DB environment instances by quickly including the best arm in duels. However, an optimality gap still exists between LLMs and classic DB algorithms in terms of strong regret. LLMs struggle to converge and consistently exploit even when explicitly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents· underline

Taxonomy

TopicsAuction Theory and Applications · Blockchain Technology Applications and Security · Imbalanced Data Classification Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · LLaMA · Cosine Annealing · Linear Layer · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer