StaQ it! Growing neural networks for Policy Mirror Descent

Alena Shilova; Alex Davey; Brahim Driss; Riad Akrour

arXiv:2506.13862·cs.LG·June 18, 2025

StaQ it! Growing neural networks for Policy Mirror Descent

Alena Shilova, Alex Davey, Brahim Driss, Riad Akrour

PDF

Open Access

TL;DR

This paper introduces StaQ, a new RL algorithm based on Policy Mirror Descent that maintains only recent Q-functions, ensuring convergence and stability while matching the performance of existing deep RL methods.

Contribution

The paper proposes and analyzes StaQ, a practical PMD-like algorithm that retains only the last M Q-functions, guaranteeing convergence and stability in deep RL.

Findings

01

StaQ converges with finite memory of Q-functions.

02

StaQ achieves competitive performance with existing deep RL algorithms.

03

StaQ exhibits less performance oscillation and increased stability.

Abstract

In Reinforcement Learning (RL), regularization has emerged as a popular tool both in theory and practice, typically based either on an entropy bonus or a Kullback-Leibler divergence that constrains successive policies. In practice, these approaches have been shown to improve exploration, robustness and stability, giving rise to popular Deep RL algorithms such as SAC and TRPO. Policy Mirror Descent (PMD) is a theoretical framework that solves this general regularized policy optimization problem, however the closed-form solution involves the sum of all past Q-functions, which is intractable in practice. We propose and analyze PMD-like algorithms that only keep the last $M$ Q-functions in memory, and show that for finite and large enough $M$ , a convergent algorithm can be derived, introducing no error in the policy update, unlike prior deep RL PMD implementations. StaQ, the resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)

MethodsTrust Region Policy Optimization