$f$-Policy Gradients: A General Framework for Goal Conditioned RL using   $f$-Divergences

Siddhant Agarwal; Ishan Durugkar; Peter Stone; Amy Zhang

arXiv:2310.06794·cs.LG·October 11, 2023·1 cites

$f$-Policy Gradients: A General Framework for Goal Conditioned RL using $f$-Divergences

Siddhant Agarwal, Ishan Durugkar, Peter Stone, Amy Zhang

PDF

Open Access

TL;DR

This paper introduces $f$-Policy Gradients, a framework that uses f-divergences to improve exploration and policy optimization in goal-conditioned RL with sparse rewards, providing a unified approach for metric-based reward shaping.

Contribution

The paper proposes $f$-Policy Gradients, a novel method that minimizes f-divergences between state visitation and goals, enabling dense learning signals and better exploration in sparse reward environments.

Findings

01

$f$-PG outperforms standard policy gradients on gridworld, Point Maze, and FetchReach environments.

02

Introduces $s$-MaxEnt RL, a regularized objective for metric-based reward shaping.

03

Provides a unified framework for using metric-based shaping rewards with efficient exploration.

Abstract

Goal-Conditioned Reinforcement Learning (RL) problems often have access to sparse rewards where the agent receives a reward signal only when it has achieved the goal, making policy optimization a difficult problem. Several works augment this sparse reward with a learned dense reward function, but this can lead to sub-optimal policies if the reward is misaligned. Moreover, recent works have demonstrated that effective shaping rewards for a particular problem can depend on the underlying learning algorithm. This paper introduces a novel way to encourage exploration called $f$ -Policy Gradients, or $f$ -PG. $f$ -PG minimizes the f-divergence between the agent's state visitation distribution and the goal, which we show can lead to an optimal policy. We derive gradients for various f-divergences to optimize this objective. Our learning paradigm provides dense learning signals for exploration in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research