Mirror Descent Actor Critic via Bounded Advantage Learning

Ryo Iwaki

arXiv:2502.03854·cs.LG·January 9, 2026

Mirror Descent Actor Critic via Bounded Advantage Learning

Ryo Iwaki

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new actor-critic reinforcement learning method called MDAC, which improves continuous action domain performance by bounding advantage estimates, combining theoretical insights with empirical validation.

Contribution

The paper proposes MDAC, an actor-critic algorithm that enhances regularized RL by bounding advantage estimates, and provides theoretical and empirical analysis of its benefits.

Findings

01

MDAC outperforms non-regularized and entropy-only methods with proper bounding.

02

Bounding advantage terms improves empirical performance in continuous domains.

03

Theoretical analysis supports the use of bounded advantage in regularized RL.

Abstract

Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of KL-entropy-regularized methods does not surpass that of a strong entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor's log-density terms in the critic's loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor's log-probability is equal to the…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

* After applyinig the bound operation, the algorithm is empirically observed to converge faster and achieve higher scores. * The clip operation is easy to implement.

Weaknesses

* The justification on larger error tolerance for critic value estimation is not valid. Specifically, the paper argues that the proposed algorithm's error term's **upper bound** (the last term in equation (10)) is lower compared to the baseline algorithm (line 345-350). However, lower upper bound does not indicate that the term is lower. Given that the main motivation for the algorithm is its better error tolerance, a rigorous justification is critical. * The writing needs to be polished; the p

Reviewer 02Rating 6Confidence 4

Strengths

The strengths of the paper include its originality, clarity, and significance: 1. The originality of the paper is one of its strengths. Although the technique of bounding the log density term in the off-policy case is ad-hoc and not entirely novel (as the original Munchausen RL also has a similar variant), the theoretical results are new to my knowledge. 2. The paper is mostly clear. It has clear writing and is easy to follow. It also covers most of the important related works. 3. The studied pr

Weaknesses

Despite having many strengths, this paper is weak in the following areas: 1. The empirical evaluation is limited. The main empirical results only include experiments on six MuJoCo environments and show only marginal improvements over the baseline. There are many other commonly used continuous control benchmarks (e.g., DeepMind Control Suite and Omniverse Isaac Gym environments), including some experiments for these environments that would strengthen the paper, especially if there is a larger imp

Reviewer 03Rating 6Confidence 3

Strengths

**Good motivation** - Improvement of the performance of Mirror Descent RL on continuous action tasks. **Theoretical and Empirical Rigour** - The progression from MDVI to MDAC is well-motivated, and the authors thoughtfully discuss the relation to SAC’s temperature tuning. I found the performance gap between non-bounded (identity) and bounded (tanh) log-policy to be surprisingly substantial.

Weaknesses

**Section 4** - It seems that bounding the log-policy was meant to yield a tighter bound in Theorem 3, connecting it to Advantage Learning (AL). Could you clarify if BAL was introduced specifically for this purpose? - L224: I didn’t see definitions for regularized MDP and soft state value function—could you include these? - L227: Why does V(s)=max_{⁡a∈A} Q(s,a) hold when α=0? - L239 (Eq. 9): Could you provide the derivation for this Bellman operator? - L245: Could you explain how the gap-increas

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpinion Dynamics and Social Influence · Innovative Teaching Methodologies in Social Sciences · Qualitative Comparative Analysis Research