Value Mirror Descent for Reinforcement Learning

Zhichao Jia; Guanghui Lan

arXiv:2604.06039·math.OC·April 8, 2026

Value Mirror Descent for Reinforcement Learning

Zhichao Jia, Guanghui Lan

PDF

TL;DR

This paper introduces a novel value mirror descent method for reinforcement learning that improves convergence and sample complexity, especially in offline and high-accuracy settings.

Contribution

The paper develops a new value optimization algorithm, VMD, integrating mirror descent into value iteration, with stochastic variants and theoretical guarantees.

Findings

01

VMD converges linearly in deterministic settings.

02

SVMD achieves near-optimal sample complexity with variance reduction.

03

Bounded Bregman divergence enables effective online learning.

Abstract

Value iteration-type methods have been extensively studied for computing a nearly optimal value function in reinforcement learning (RL). Under a generative sampling model, these methods can achieve sharper sample complexity than policy optimization approaches, particularly in their dependence on the discount factor. In practice, they are often employed for offline training or in simulated environments. In this paper, we consider discounted Markov decision processes with state space S, action space A, discount factor $γ \in (0, 1)$ and costs in $[0, 1]$ . We introduce a novel value optimization method, termed value mirror descent (VMD), which integrates mirror descent from convex optimization into the classical value iteration framework. In the deterministic setting with known transition kernels, we show that VMD converges linearly. For the stochastic setting with a generative model, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.