Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options
Joongkyu Lee, Seouh-won Yi, Min-hwan Oh

TL;DR
This paper introduces a new algorithm for preference-based reinforcement learning that leverages multiple options and ranking feedback, demonstrating improved sample efficiency as subset size increases, with strong theoretical guarantees.
Contribution
We propose M-AUPO, an algorithm that uses ranking feedback over action subsets, with theoretical analysis showing improved performance with larger subsets and overcoming previous limitations.
Findings
Larger subsets improve sample efficiency in PbRL.
M-AUPO achieves a suboptimality gap that decreases with subset size.
Theoretical bounds show benefits of multiple options in ranking feedback.
Abstract
We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing · Multimodal Machine Learning Applications
