Tokenized Bandit for LLM Decoding and Alignment

Suho Shin; Chenghao Yang; Haifeng Xu; Mohammad T. Hajiaghayi

arXiv:2506.07276·cs.LG·June 10, 2025

Tokenized Bandit for LLM Decoding and Alignment

Suho Shin, Chenghao Yang, Haifeng Xu, Mohammad T. Hajiaghayi

PDF

Open Access 1 Video

TL;DR

None

Contribution

None

Abstract

We introduce the tokenized linear bandit (TLB) and multi-armed bandit (TMAB), variants of linear and stochastic multi-armed bandit problems inspired by LLM decoding and alignment. In these problems, at each round $t \in [T]$ , a user submits a query (context), and the decision maker (DM) sequentially selects a token irrevocably from a token set. Once the sequence is complete, the DM observes a random utility from the user, whose expectation is presented by a sequence function mapping the chosen token sequence to a nonnegative real value that depends on the query. In both problems, we first show that learning is impossible without any structure on the sequence function. We introduce a natural assumption, diminishing distance with more commons (DDMC), and propose algorithms with regret $\tilde{O} (L T)$ and $\tilde{O} (L T^{2/3})$ for TLB and TMAB, respectively. As a side…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Tokenized Bandit for LLM Decoding and Alignment· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Mobile Crowdsensing and Crowdsourcing