DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
Guojun Xiong, Ujwal Dinesha, Debajoy Mukherjee, Jian Li, Srinivas Shakkottai

TL;DR
This paper introduces DOPL, an online algorithm for restless bandit problems using preference feedback instead of scalar rewards, achieving sublinear regret and demonstrating effectiveness through experiments.
Contribution
The paper proposes the first algorithm for RMAB with preference feedback, enabling direct online learning and decision-making without explicit reward functions.
Findings
DOPL achieves (( ext{T} ext{ln T})) regret.
DOPL efficiently explores and adapts to unknown environments.
Experimental results confirm DOPL's effectiveness.
Abstract
Restless multi-armed bandits (RMAB) has been widely used to model constrained sequential decision making problems, where the state of each restless arm evolves according to a Markov chain and each state transition generates a scalar reward. However, the success of RMAB crucially relies on the availability and quality of reward signals. Unfortunately, specifying an exact reward function in practice can be challenging and even infeasible. In this paper, we introduce Pref-RMAB, a new RMAB model in the presence of \textit{preference} signals, where the decision maker only observes pairwise preference feedback rather than scalar reward from the activated arms at each decision epoch. Preference feedback, however, arguably contains less information than the scalar reward, which makes Pref-RMAB seemingly more difficult. To address this challenge, we present a direct online preference learning…
Peer Reviews
Decision·ICLR 2025 Poster
see the first box
see the first box
1. The paper successfully integrates preference feedback within the RMAB framework, a novel approach that shifts away from traditional scalar reward dependency. Moreover, the presented algorithm DOPL achieves $\tilde{O}(\sqrt{T \ln T})$ regret with theoretical analysis. 2. The relaxed LP-based direct index policy of DOPL is also given to tackle the limitations of computational intractability. 3. The writing is clean and easy to follow.
1. Estimating the whole preference matrix $F$ in DOPL algorithm requires large computational cost. Moreover, it would be beneficial to involve a thorough discussion on computational complexity of DOPL. 2. In experiments, the existing algorithms like MLE_WIBQL, MLE_LP fail to achieve sublinear regret. A detailed discussion on why these algorithms underperform in achieving sublinear regret would provide valuable insights.
1. Although RLHF has recently gained significant attention due to its applications in large language models and robotics, this work is the first to consider a preference-based reward model in the restless bandit problem, opening the door for RLHF to be applied much more broadly. 2. By establishing a connection between pairwise preferences and reward values, the authors transform the reward value of each arm and state into a measure based on the preference probability between this state and a re
• 1. A question remains as to whether a preference-based model can outperform direct reward estimation, and whether we really need the preference-base model in RMAB problem. In the first two examples presented in the paper, APP MARKETING and CPAP TREATMENT, while reward data is challenging to estimate accurately and may contain substantial noise, it can still be estimated in some form. Conversely, since preference data inherently provides less information, it is unclear whether incorporating pre
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Data Stream Mining Techniques
