Maximum-Entropy Exploration with Future State-Action Visitation Measures
Adrien Bolland, Gaspard Lambrechts, Damien Ernst

TL;DR
This paper introduces a new intrinsic reward based on the entropy of future state-action features, which improves exploration efficiency and convergence speed in reinforcement learning, while maintaining comparable control performance.
Contribution
It proposes a novel entropy-based intrinsic reward derived from future state-action features, with theoretical guarantees and off-policy estimability, enhancing exploration in reinforcement learning.
Findings
Improved feature visitation within trajectories.
Faster convergence for exploration agents.
Similar control performance across benchmarks.
Abstract
Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new…
Peer Reviews
Decision·Submitted to ICLR 2026
Integrating visitation distributions into MaxEntRL is an interesting idea. There are theoretical results (e.g contractive properties and KL lower bounds) and numerical experiments on MiniGrid environment.
1. The function $h$ is central but under-specified. How to choose or parameterize it for high-dimensional states is unclear 2. Only small, discrete MiniGrid environments are tested. No continuous-control task are considered. The current experiments rely on hand-crafted discrete features (agent position) limiting generality. 3. There is no intuitive explanation of why the lower bound of Th. 3.2 is meaningful for exploration or what properties it preserves. What is L in this theorem. It would be
- The paper provides an elegant unification of different "MaxEntRL" approaches by formalizing the intrinsic reward with a separate feature space, which can alternatively be the action space for standard action entropy incentives or the state space for state entropy exploration; - The paper provides an original version of the state entropy exploration objective that only looks at the entropy of the future discounted state (or state-action) distribution conditioned on the current state; - The pape
- The paper seems to mischaracterize existing state entropy algorithms as inherently on-policy. While several of the existing implementations are used on-policy, they can be easily adapted to work off-policy; - After reading the paper and looking at experimental results, why conditional entropy shall be preferred to marginal entropy is largely unclear to me; - The paper does not much to clarify why MiniGrid has been chosen to compare the performance of conditional visitation entropy with prior w
The author provide a theoretical analysis for the proposed method.
The paper is poorly written and contains unclear language and disorganized structure. 1. **Use of Eq (4) for Intrinsic Reward Formulation:** Why is Eq (4) chosen as the intrinsic reward formulation? How can this formulation be equivalent to the MaxEntRL or the particle-based entropy estimation method (Hazan et al., 2019) mentioned in line 14? 2. **Explanation of Variables and Distributions:** The explanation of variables and distributions is inadequate. What do \( \bar{s} \) and \( \
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Adaptive Dynamic Programming Control
