Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design

Andreas Schlaginhaufen; Reda Ouhamma; Maryam Kamgarpour

arXiv:2506.09508·cs.LG·December 5, 2025

Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design

Andreas Schlaginhaufen, Reda Ouhamma, Maryam Kamgarpour

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a new randomized exploration algorithm for preference-based reinforcement learning that improves query efficiency and supports parallel feedback collection, with theoretical guarantees and competitive empirical performance.

Contribution

It proposes a tractable meta-algorithm combining randomized exploration with experimental design, providing regret guarantees and enhanced query efficiency in preference-based RL.

Findings

01

The method achieves competitive performance with reward-based RL.

02

It requires fewer preference queries compared to existing approaches.

03

Parallelization of queries is feasible and effective.

Abstract

We study reinforcement learning from human feedback in general Markov decision processes, where agents learn from trajectory-level preference comparisons. A central challenge in this setting is to design algorithms that select informative preference queries to identify the underlying reward while ensuring theoretical guarantees. We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches and remains tractable. We establish both regret and last-iterate guarantees under mild reinforcement learning oracle assumptions. To improve query complexity, we introduce and analyze an improved algorithm that collects batches of trajectory pairs and applies optimal experimental design to select informative comparison queries. The batch structure also enables parallelization of preference queries, which is relevant in…

Peer Reviews

Decision·NeurIPS 2025 poster

Reviewer 01Rating 5Confidence 3

Strengths

## Strengths **Idea and Motivation**: The paper explores a compelling and timely idea, particularly relevant given the increasing focus on preference-based learning. One of the major bottlenecks in this field is the annotator burden and latency associated with collecting human preferences. The proposed approach directly addresses this challenge, making the contribution both relevant and potentially impactful. **Writing and Presentation**: The writing and structure of the paper are strong.

Reviewer 02Rating 4Confidence 3

Strengths

I found the clarity of this paper to be great. It is easy to follow, and all algorithms and theoretical results are well explained. Careful comparisons to existing work are also provided, and thus I appreciate the time the authors spent on writing this. The paper proposes two algorithms with theoretical guarantees. Although i didn't checked the proofs in detail, the results look sound. One weakness is that the paper could benefit from a clearer “highlight” of its main novelty. From my understan

Reviewer 03Rating 4Confidence 4

Strengths

Strengths: 1. The paper provides a clear framework for RLHF with strong theoretical guarantees (regret and PAC bounds) 2. The LRPO-OD variant introduces practical variantions (lazy updates and D-optimal design) to enable parallelization, directly addressing a key bottleneck in real-world RLHF. Weaknesses: 1. Empirical validation is limited to a single, simple simulated environment, making it difficult to assess real-world performance and scalability. 2. The framework is confined to a linear rew

Reviewer 04Rating 4Confidence 3

Strengths

### Strengths The preference-based learning setting is reasonable. The structure of this paper is clear and easy to follow. Numerical experiments are provided. ### Weaknesses 1. Although there are only a few previous work studying preference-based setting, I did not find much technical contribution in this paper. It seems to me most of the results are standard and non-surprising. 2. Another part makes me confused is that how important is the efficiency in learning the transition function.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization