Reducing Reward Dependence in RL Through Adaptive Confidence Discounting

Muhammed Yusuf Satici; David L. Roberts

arXiv:2502.21181·cs.LG·March 3, 2025

Reducing Reward Dependence in RL Through Adaptive Confidence Discounting

Muhammed Yusuf Satici, David L. Roberts

PDF

TL;DR

This paper introduces a reinforcement learning method that minimizes reliance on costly or human feedback by requesting rewards only when the model's confidence is low, thus improving learning efficiency.

Contribution

The authors propose a novel adaptive confidence-based reward discounting algorithm that reduces reward dependence while maintaining policy quality.

Findings

01

Achieves comparable performance with only 20% of the rewards used by baseline.

02

Reduces reward dependence without sacrificing learning effectiveness.

03

Effective in environments with expensive or human-in-the-loop reward signals.

Abstract

In human-in-the-loop reinforcement learning or environments where calculating a reward is expensive, the costly rewards can make learning efficiency challenging to achieve. The cost of obtaining feedback from humans or calculating expensive rewards means algorithms receiving feedback at every step of long training sessions may be infeasible, which may limit agents' abilities to efficiently improve performance. Our aim is to reduce the reliance of learning agents on humans or expensive rewards, improving the efficiency of learning while maintaining the quality of the learned policy. We offer a novel reinforcement learning algorithm that requests a reward only when its knowledge of the value of actions in an environment state is low. Our approach uses a reward function model as a proxy for human-delivered or expensive rewards when confidence is high, and asks for those explicit rewards…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.