Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design
Andreas Schlaginhaufen, Reda Ouhamma, Maryam Kamgarpour

TL;DR
This paper introduces a new randomized exploration algorithm for preference-based reinforcement learning that improves query efficiency and supports parallel feedback collection, with theoretical guarantees and competitive empirical performance.
Contribution
It proposes a tractable meta-algorithm combining randomized exploration with experimental design, providing regret guarantees and enhanced query efficiency in preference-based RL.
Findings
The method achieves competitive performance with reward-based RL.
It requires fewer preference queries compared to existing approaches.
Parallelization of queries is feasible and effective.
Abstract
We study reinforcement learning from human feedback in general Markov decision processes, where agents learn from trajectory-level preference comparisons. A central challenge in this setting is to design algorithms that select informative preference queries to identify the underlying reward while ensuring theoretical guarantees. We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches and remains tractable. We establish both regret and last-iterate guarantees under mild reinforcement learning oracle assumptions. To improve query complexity, we introduce and analyze an improved algorithm that collects batches of trajectory pairs and applies optimal experimental design to select informative comparison queries. The batch structure also enables parallelization of preference queries, which is relevant in…
Peer Reviews
Decision·NeurIPS 2025 poster
## Strengths **Idea and Motivation**: The paper explores a compelling and timely idea, particularly relevant given the increasing focus on preference-based learning. One of the major bottlenecks in this field is the annotator burden and latency associated with collecting human preferences. The proposed approach directly addresses this challenge, making the contribution both relevant and potentially impactful. **Writing and Presentation**: The writing and structure of the paper are strong.
I found the clarity of this paper to be great. It is easy to follow, and all algorithms and theoretical results are well explained. Careful comparisons to existing work are also provided, and thus I appreciate the time the authors spent on writing this. The paper proposes two algorithms with theoretical guarantees. Although i didn't checked the proofs in detail, the results look sound. One weakness is that the paper could benefit from a clearer “highlight” of its main novelty. From my understan
Strengths: 1. The paper provides a clear framework for RLHF with strong theoretical guarantees (regret and PAC bounds) 2. The LRPO-OD variant introduces practical variantions (lazy updates and D-optimal design) to enable parallelization, directly addressing a key bottleneck in real-world RLHF. Weaknesses: 1. Empirical validation is limited to a single, simple simulated environment, making it difficult to assess real-world performance and scalability. 2. The framework is confined to a linear rew
### Strengths The preference-based learning setting is reasonable. The structure of this paper is clear and easy to follow. Numerical experiments are provided. ### Weaknesses 1. Although there are only a few previous work studying preference-based setting, I did not find much technical contribution in this paper. It seems to me most of the results are standard and non-surprising. 2. Another part makes me confused is that how important is the efficiency in learning the transition function.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
