Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures
Adrien Bolland, Gaspard Lambrechts, Damien Ernst

TL;DR
This paper introduces a novel off-policy maximum entropy reinforcement learning method that uses future state and action visitation measures as intrinsic rewards, leading to improved exploration and control performance.
Contribution
The paper proposes a new intrinsic reward based on the relative entropy of future visitation distributions, enabling off-policy learning of this measure and enhancing exploration.
Findings
Policies achieve high state-action coverage
Method improves exploration efficiency
Results show strong control performance
Abstract
Maximum entropy reinforcement learning integrates exploration into policy learning by providing additional intrinsic rewards proportional to the entropy of some distribution. In this paper, we propose a novel approach in which the intrinsic reward function is the relative entropy of the discounted distribution of states and actions (or features derived from these states and actions) visited during future time steps. This approach is motivated by two results. First, a policy maximizing the expected discounted sum of intrinsic rewards also maximizes a lower bound on the state-action value function of the decision process. Second, the distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Existing algorithms can therefore be adapted to learn this fixed point off-policy and to compute the intrinsic rewards. We finally introduce an algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Semiconductor Devices and Circuit Design
