Improving Reward-Conditioned Policies for Multi-Armed Bandits using   Normalized Weight Functions

Kai Xu; Farid Tajaddodianfar; Ben Allison

arXiv:2406.10795·cs.LG·June 18, 2024

Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions

Kai Xu, Farid Tajaddodianfar, Ben Allison

PDF

Open Access

TL;DR

This paper enhances reward-conditioned policies for multi-armed bandits by introducing normalized weight functions, improving convergence speed and expected rewards, especially in large action spaces and sparse reward scenarios.

Contribution

It proposes a generalized marginalization technique using normalized weight functions to improve RCP performance in multi-armed bandit problems.

Findings

01

Improved RCP convergence and rewards with normalized weights.

02

Competitive performance of RCPs against classic methods like UCB and Thompson sampling.

03

Effective in large action spaces and sparse reward environments.

Abstract

Recently proposed reward-conditioned policies (RCPs) offer an appealing alternative in reinforcement learning. Compared with policy gradient methods, policy learning in RCPs is simpler since it is based on supervised learning, and unlike value-based methods, it does not require optimization in the action space to take actions. However, for multi-armed bandit (MAB) problems, we find that RCPs are slower to converge and have inferior expected rewards at convergence, compared with classic methods such as the upper confidence bound and Thompson sampling. In this work, we show that the performance of RCPs can be enhanced by constructing policies through the marginalization of rewards using normalized weight functions, whose sum or integral equal $1$ , although the function values may be negative. We refer to this technique as generalized marginalization, whose advantage is that negative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Forecasting Techniques and Applications · Supply Chain and Inventory Management