LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency

Xiao-Yin Liu; Guotao Li; Xiao-Hu Zhou; Zeng-Guang Hou

arXiv:2412.21001·cs.LG·February 10, 2026

LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency

Xiao-Yin Liu, Guotao Li, Xiao-Hu Zhou, Zeng-Guang Hou

PDF

Open Access 1 Repo 3 Reviews

TL;DR

LEASE introduces a sample-efficient offline preference-based reinforcement learning algorithm that uses a learned transition model and an uncertainty-aware mechanism to reduce human labeling effort while maintaining high performance.

Contribution

The paper proposes LEASE, a novel offline PbRL algorithm that leverages a transition model and confidence-based data selection to improve sample efficiency and reward accuracy.

Findings

01

Achieves comparable performance with fewer preference labels.

02

Uses a transition model to generate unlabeled preference data.

03

Provides theoretical guarantees on policy improvement.

Abstract

Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

- The proposed LEASE algorithm is novel, which introduces a new way to handle the challenge of limited preference data with a learned transition model. The selection mechanism for unlabeled data based on confidence and uncertainty is a thoughtful contribution to improving the stability and accuracy of the reward model. -Theoretical Framework: The paper attempts to provide a theoretical foundation for the algorithm, which is a step towards more principled offline PbRL methods.

Weaknesses

- While the paper provides a theoretical analysis of the algorithm, it is not rigorous enough. For example, the approximation error of the reward function usually depends on a condition number that is exponential to $R_{\text{max}}$ when learned from preference data [1], but it seems missing from the derived bound. The approximation error of reward error does not directly translate into an additional error term in the performance bound and requires careful treatment (e.g., use a pessimistic rewa

Reviewer 02Rating 5Confidence 3

Strengths

1. The paper is well-organized, with the methodology and theoretical contributions presented clearly, making the core ideas easy to understand. 2. Improving sample efficiency in offline PbRL is an important and challenging problem, with significant implications for many real-world applications. 3. The paper provides contributions both in empirical algorithm development and theoretical analysis. Although there are concerns regarding the assumptions, the theoretical contributions add valuable insi

Weaknesses

1. The proposed LEASE algorithm closely follows the pipeline of Surf[1], with only two major differences: (1) Surf augments data using random temporal cropping, whereas LEASE generates synthetic data with a learned transition model; (2) Surf only uses confidence for label filtering, whereas LEASE employs both confidence and uncertainty principles. Despite these differences, a direct comparison with Surf is missing, which would help clarify the effects of the introduced transition model and the u

Reviewer 03Rating 5Confidence 3

Strengths

1. The problem of preference efficiency is vital and fundamental in offline meta-RL. 2. The proposed method is sound and sensible. 3. Experiment results are convincing.

Weaknesses

1. Lack of analysis on the effect of transition model accuracy. The learned transition model can be inaccurate, and accumulate errors during rollout. Will this do a lot of damage to algorithm performance? 2. Lack of analysis of baseline algorithms' performance with different numbers of preference data. How much data is needed for baseline algorithms to achieve comparable performance to LEASE? 3. I would like to see results of more baseline algorithms, e.g., previous model-based offline RL algori

Code & Models

Repositories

xiaoyinliu0714/LEASE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics