Provably Efficient Multi-Objective Bandit Algorithms under Preference-Centric Customization
Linfeng Cao, Ming Shi, Ness B. Shroff

TL;DR
This paper introduces a novel framework for multi-objective bandit algorithms that incorporate explicit user preferences, shifting focus from Pareto optimality to preference-aligned optimization, with theoretical guarantees and strong empirical results.
Contribution
First theoretical study of preference-aware multi-objective bandit algorithms that adapt to explicit user preferences, with new analytical techniques and practical algorithms.
Findings
Algorithms achieve near-optimal regret bounds.
Strong empirical performance demonstrates effectiveness.
Addresses unknown and hidden preference scenarios.
Abstract
Multi-objective multi-armed bandit (MO-MAB) problems traditionally aim to achieve Pareto optimality. However, real-world scenarios often involve users with varying preferences across objectives, resulting in a Pareto-optimal arm that may score high for one user but perform quite poorly for another. This highlights the need for customized learning, a factor often overlooked in prior research. To address this, we study a preference-aware MO-MAB framework in the presence of explicit user preference. It shifts the focus from achieving Pareto optimality to further optimizing within the Pareto front under preference-centric customization. To our knowledge, this is the first theoretical study of customized MO-MAB optimization with explicit user preferences. Motivated by practical applications, we explore two scenarios: unknown preference and hidden preference, each presenting unique challenges…
Peer Reviews
Decision·Submitted to ICLR 2025
* The paper proposes a new Preference-Aware MO-MAB framework and provides very illustrative examples to help the reader understand it. * Multiple different settings for the preference feedback are considered. This paper designed a tailored algorithm for each particular setting and provides comprehensive theoretical analyses. * Sublinear regrets are guaranteed under all considered settings.
While this paper appears interesting and promising, several concerns need to be raised, including issues related to the presentation, the novelty of the algorithms, and the solidity of the theoretical results. Please refer to **Questions** below.
1. The authors propose a natural preference-aware MO-MAB problem and conduct a systematic study with several different settings, considering both unknown rewards and/or preferences as well as the corruption and non-stationary settings. 2. The presentation is very clear and easy to follow. Almost all settings are illustrated with intuitive examples to motivate each setting.
Although the authors give a comprehensive study over a wide range of settings, each setting can be easily reformulated or mapped as existing MAB settings, where leveraging existing approaches such as UCB/LinUCB with slight modification yields the result of the current paper. The core is to use the scalarization function to weight each objective and the scalarization function is modeled as a (unknown) preference vector. As such, the current study seems to be a wide combinatorial of separate piec
The work is well motivated by the limitations of the pareto bandit framework. They make a good case for their augmentations of the MO-MAB problem and provides several real-world examples to motivate their proposed structure. The presentation of the various variants of this structure is concise, well articulated, and backed by good visual representations. A principled algorithm is presented to solve the problem for each of the cases. Further these UCB-style algorithms are analyzed and order wise
1. Some experiments should be included in the narrative of the main paper along with a discussion that contextualizes the baselines. Please consider adding an experiment comparing prior MAB-approaches listed in the literature (S-UCB, S-MOSS, Pareto-UCB, Pareto-TS) into the main paper. What I would be most interested to see is how and why these approaches that are agnostic to user preferences are configured for comparison. The how is already answered in Appendix A.1.1 however the why should also
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Advanced Multi-Objective Optimization Algorithms · Machine Learning and Data Classification
MethodsFocus
