Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

Jean Seong Bjorn Choe; Jong-Kook Kim

arXiv:2407.18143·cs.LG·July 26, 2024

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

Jean Seong Bjorn Choe, Jong-Kook Kim

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a simple method to incorporate maximum entropy reinforcement learning into on-policy actor-critic algorithms, improving policy performance and generalisation in complex tasks.

Contribution

It proposes a novel approach to separate the entropy objective from the main objective, enabling effective application of MaxEnt RL in on-policy settings.

Findings

01

Extending PPO and TRPO with MaxEnt RL improves performance.

02

MaxEnt RL enhances policy generalisation.

03

Empirical results on MuJoCo and Procgen tasks support effectiveness.

Abstract

Entropy Regularisation is a widely adopted technique that enhances policy optimisation performance and stability. A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy. This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes. However, its practical application in straightforward on-policy actor-critic settings remains surprisingly underexplored. We hypothesise that this is due to the difficulty of managing the entropy reward in practice. This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings. Our empirical evaluations demonstrate that extending Proximal Policy Optimisation (PPO) and Trust Region…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 3· reject, not good enoughConfidence 3

Strengths

- Novel idea with separated critics that allows to deal with the problem of different scaling of rewards and policy entropy; - One of the first algorithms in the nearly empty niche of MaxEnt on-policy RL algorithms;

Weaknesses

The major weakness of the presented paper is weak empirical validation of the proposed method: lack of baseline comparisons and ablation studies. - No experimental comparison to other on-policy methods such as TRPO and Mirror Descent Policy optimization (MDPO); - Lack of ablations study: for example, the effect of different discounting and GAE coefficients for reward and entropy value is not studied properly; Tomar, M., Shani, L., Efroni, Y., & Ghavamzadeh, M. (2020). Mirror descent policy opt

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

1. There have been many approaches to improving the performance of on-policy algorithms by using entropy regularization to increase exploration and enhance stability. There has also been a significant body of research on using the MaxEnt RL framework, which increases the expected trajectory entropy in off-policy algorithms like SAC to improve performance. However, the use of MaxEnt RL in on-policy algorithms has been underexplored. Through research that combines PPO with MaxEnt RL, a new algorit

Weaknesses

1. Approaches like the soft advantage function and GAE have been explored extensively in previous off-policy MaxEnt RL research, and there have been studies combining on-policy RL with the MaxEnt RL framework. Since the performance difference between simply adding entropy reward to PPO and the proposed algorithm appears to be similar in the experiments, additional contributions would be valuable. 2. The direction of integrating MaxEnt RL into PPO, which traditionally used entropy regularization

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

Change the type of entropy term might help performance pratically.

Weaknesses

This paper is more like a proposal or an experimental report, but not a done paper. 1. Poor writing and limited literature review in Introduction and Related Works. 2. Proposing a practical trick is ok for a paper, but you must provide abundant theoretical and empirical support. * Lack of theoretical support for the proposed trick. * Marginal performance improvement compared to PPO in MuJoCo tasks. * Only implement the trick on PPO but claim its generalizability. * Few results

Reviewer 04Rating 3· reject, not good enoughConfidence 4

Strengths

1. Extending MaxEnt RL to on-policy algorithms seems to be a reasonable idea. 2. The EAPO is heuristic but makes sense. 3. EAPO is empirically shown to outperform variants of PPO with entropy regularization.

Weaknesses

1. It is not 100% clear to me why MaxEnt RL should be studied on the on-policy context. The paper motivates it simply by saying some tasks are more suited for on-policy RL, with some superfacial examples at the end of Introduction, but it fails to show the actual advantage of on-policy MaxEnt RL over off-policy MaxEnt RL from either a theoretical or an empirical point of view. 2. The key technical contribution is the separate advantage computation for the value and entropy, which seems to be a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics