Value Mirror Descent for Reinforcement Learning
Zhichao Jia, Guanghui Lan

TL;DR
This paper introduces a novel value mirror descent method for reinforcement learning that improves convergence and sample complexity, especially in offline and high-accuracy settings.
Contribution
The paper develops a new value optimization algorithm, VMD, integrating mirror descent into value iteration, with stochastic variants and theoretical guarantees.
Findings
VMD converges linearly in deterministic settings.
SVMD achieves near-optimal sample complexity with variance reduction.
Bounded Bregman divergence enables effective online learning.
Abstract
Value iteration-type methods have been extensively studied for computing a nearly optimal value function in reinforcement learning (RL). Under a generative sampling model, these methods can achieve sharper sample complexity than policy optimization approaches, particularly in their dependence on the discount factor. In practice, they are often employed for offline training or in simulated environments. In this paper, we consider discounted Markov decision processes with state space S, action space A, discount factor and costs in . We introduce a novel value optimization method, termed value mirror descent (VMD), which integrates mirror descent from convex optimization into the classical value iteration framework. In the deterministic setting with known transition kernels, we show that VMD converges linearly. For the stochastic setting with a generative model, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
