Model-free Policy Learning with Reward Gradients

Qingfeng Lan; Samuele Tosatto; Homayoon Farrahi; A. Rupam Mahmood

arXiv:2103.05147·cs.LG·November 3, 2023

Model-free Policy Learning with Reward Gradients

Qingfeng Lan, Samuele Tosatto, Homayoon Farrahi, A. Rupam Mahmood

PDF

Open Access 1 Repo

TL;DR

This paper introduces Reward Policy Gradient, a model-free method that leverages reward gradients to improve sample efficiency in policy learning, especially in data-scarce settings like robotics.

Contribution

It proposes a novel Reward Policy Gradient estimator that uses reward gradients without requiring environment models, enhancing sample efficiency and policy performance.

Findings

01

Achieves better bias-variance trade-off in reward gradient estimation.

02

Improves sample efficiency over traditional methods.

03

Boosts PPO performance on MuJoCo tasks.

Abstract

Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the \textit{Reward Policy Gradient} estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qlan3/Explorer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control