SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation

Jongmin Lee; Meiqi Sun; Pieter Abbeel

arXiv:2512.10042·cs.LG·December 12, 2025

SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation

Jongmin Lee, Meiqi Sun, Pieter Abbeel

PDF

Open Access 3 Reviews

TL;DR

SEMDICE is an off-policy reinforcement learning algorithm that maximizes state entropy by directly estimating stationary distribution corrections, leading to improved unsupervised pre-training for downstream tasks.

Contribution

It introduces a novel off-policy method that computes a stationary distribution correction to maximize state entropy directly from arbitrary datasets.

Findings

01

Outperforms baseline algorithms in state entropy maximization

02

Achieves superior adaptation efficiency in downstream tasks

03

Effective in unsupervised RL pre-training scenarios

Abstract

In the unsupervised pre-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions. We focus on state entropy maximization (SEM), where the goal is to learn a policy that maximizes the entropy of the state stationary distribution. In this paper, we introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset, which optimizes the policy directly within the space of stationary distributions. SEMDICE computes a single, stationary Markov state-entropy-maximizing policy from an arbitrary off-policy dataset. Experimental results demonstrate that SEMDICE outperforms baseline algorithms in maximizing state entropy while achieving the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 2

Strengths

* The proposed algorithm SEMDICE is a novel method suitable for off-policy training. * SEMDICE demonstrates efficient convergence in tabular MDP experiments and outperforms existing methods in RL policy pre-training. * Theoretical analysis is provided to demonstrate the efficacy of the SEMDICE.

Weaknesses

* The hyperparameter $\alpha$ may have a significant impact on the results. * Though experiments in tabular RL and continuous state spaces are conducted, it does not address high-dimensional data (such as image-based tasks), which may pose computational complexity issues. This could be a potential direction for future improvements.

Reviewer 02Rating 6Confidence 4

Strengths

The motivation is well. The particle entropy method operates off-policy but optimizes the entropy of the replay buffer rather than the target policy. In contrast, the method proposed by Hazan et al. (2019) is on-policy but sample inefficient. Using DICE-like approaches can address these issues. While the original DICE framework focuses on Linear MDPs (where the objective function is linear with respect to the state visitation distribution), this paper examines Convex MDPs (where state entropy i

Weaknesses

1. The work of Hazan et al. (2019) is a primary reference, yet the paper lacks a comparative analysis of its experiments, particularly like the MountainCar experiment. 2. The implementation details are vague. For instance in Equation 56 of Appendix, the paper estimates $−log d^D(s)$ using a k-nearest neighbors (kNN) based particle entropy estimation, which introduces bias and contradicts the paper's motivation. This needs clarification and should be discussed in the main text. 3. For readers unf

Reviewer 03Rating 6Confidence 3

Strengths

1. This paper addresses the limitation of existing SEM methods in their inability to support off-policy learning, introducing the objective of optimizing the state distribution. By integrating optimization techniques from the DICE family of methods, the authors propose a well-motivated and logically consistent algorithm. 2. The mathematical derivations in this paper are both accurate and well-founded. Specifically, this paper provides a detailed explanation of why it is necessary to introduce

Weaknesses

1. I think there may be an oversight in the derivation. Specifically, in Equation (15), the summation over $s$ is replaced by sampling $s$ from $D$ through importance sampling, which in fact overlooks cases where $s$ is not in $D$. Therefore, if $s$ is severely out-of-distribution (OOD), the calculated $ \log \bar{d}(s) $ may be highly inaccurate. Consequently, even though SEMDICE is off-policy, it still introduces a significant bias. The authors might consider adding experiments to demonstrate

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification