Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence

Denis Zorba; David \v{S}i\v{s}ka; Lukasz Szpruch

arXiv:2602.10838·math.OC·February 12, 2026

Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence

Denis Zorba, David \v{S}i\v{s}ka, Lukasz Szpruch

PDF

Open Access

TL;DR

This paper establishes stability and convergence guarantees for mirror descent actor-critic algorithms in entropy-regularized MDPs over general spaces, including conditions for linear and sub-linear convergence.

Contribution

It provides the first rigorous convergence analysis for policy mirror descent with inexact advantage functions in general state and action spaces, introducing a multi-TD step variant.

Findings

01

Single-loop actor-critic scheme is stable and convergent under certain conditions.

02

A multi-TD step variant improves stability, with explicit lower bounds on TD steps.

03

Sub-linear and linear convergence rates are established depending on TD steps and assumptions.

Abstract

We provide theoretical guarantees for convergence of discrete-time policy mirror descent with inexact advantage functions updated using temporal difference (TD) learning for entropy regularised MDPs in Polish state and action spaces. We rigorously derive sufficient conditions under which the single-loop actor-critic scheme is stable and convergent. To weaken these conditions, we introduce a variant that performs multiple TD steps per policy update and derive an explicit lower bound on the number of TD steps required to ensure stability. Finally, we establish sub-linear convergence when the number of TD steps grows logarithmically with the number of policy updates, and linear convergence when it grows linearly under a concentrability assumption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdaptive Dynamic Programming Control · Reinforcement Learning in Robotics · Neural Networks and Reservoir Computing