TL;DR
This paper introduces a new framework for preference-based pure exploration in bandit problems with vector rewards, deriving lower bounds, analyzing geometry effects, and proposing an optimal algorithm.
Contribution
It develops a lower bound on sample complexity considering preference geometry, and proposes the PreTS algorithm with asymptotic optimality for identifying preferred policies.
Findings
Derived a novel lower bound on sample complexity.
Designed the PreTS algorithm for preference-based exploration.
Proved asymptotic optimality of PreTS.
Abstract
We study the preference-based pure exploration problem for bandits with vector-valued rewards. The rewards are ordered using a (given) preference cone and our goal is to identify the set of Pareto optimal arms. First, to quantify the impact of preferences, we derive a novel lower bound on sample complexity for identifying the most preferred policy with a confidence level . Our lower bound elicits the role played by the geometry of the preference cone and punctuates the difference in hardness compared to existing best-arm identification variants of the problem. We further explicate this geometry when the rewards follow Gaussian distributions. We then provide a convex relaxation of the lower bound and leverage it to design the Preference-based Track and Stop (PreTS) algorithm that identifies the most preferred policy. Finally, we show that the sample complexity of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSparse Evolutionary Training
