Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
Calarina Muslimani, Matthew E. Taylor

TL;DR
This paper introduces Sub-optimal Data Pre-training (SDP), a method that uses reward-free, low-quality data to pre-train reward models, significantly reducing human interaction needs in human-in-the-loop reinforcement learning.
Contribution
The paper proposes SDP, a novel pre-training approach leveraging sub-optimal data to enhance reward model learning and improve feedback efficiency in human-in-the-loop RL.
Findings
SDP improves reward model training without human labels.
SDP enhances RL performance across robotic tasks.
SDP reduces human interaction requirements.
Abstract
To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train…
Peer Reviews
Decision·ICLR 2025 Poster
- The main idea of this submission is straightforward and intuitive. Utilizing sub-optimal data is a popular way to improve sample efficiency. - The submission is well-written and easy to follow. Visualizations are clear and helpful. - The demonstrated experiments are reasonably comprehensive, invluding robotic locomotion and manipulation tasks. - The submission includes a 16-people human subject study. - The result analysis is well-formulated with multiple seeds and significant tests.
- My main concern about the submission is that the contribution is incremental without significant advance. Leveraging a set of sub-optimal data generated with randomly initialized policy as a warm-start for the reward model is straight forward, and seems to me the only major contribution. The authors follow the standard RLHF framework such as Bradley-Terry model, etc. This work failed to address any of the existing challenges, such as the assumption of linear reward feature combinations, the as
The algorithm is straighforward. The method is well explained and clear. The empirical analysis is good. Running human in the loop with human participants is admirable.
The assumption made in Equation 4 is quite restritive, since this states that the per-state reward must be low for every state in a suboptimal trajectory. There are many tasks for which a sub-optimal trajectory may receive significant non-trivial reward, suhc as achieving an intermediate goal, while still being significantly suboptimal. In addition, knowledge of the minimum environment reward can also be a limiting assumption, though less so. The method may be a little bit too straightforward,
1. This SDP approch is novel 2. This paper has some real world validation (human study)
1. The performance of the SDP method is highly dependent on the quality and representativeness of the sub-optimal data used for pre-training. Limited data availability or random policies generated from the same initial seeds can negatively impact SDP performance. 2. The human study in this paper did not account for human variance. For instance, would differences in education levels affect the quality of the labels?
Videos
Taxonomy
TopicsTraffic control and management · Reinforcement Learning in Robotics · Smart Grid Energy Management
