Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies

Xiang Li; Yuheng Zhang; Nan Jiang

arXiv:2602.23811·cs.LG·May 11, 2026

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies

Xiang Li, Yuheng Zhang, Nan Jiang

PDF

TL;DR

This paper extends offline reinforcement learning theory to parametric policies in large or continuous action spaces, overcoming previous limitations of state-wise mirror descent methods.

Contribution

It introduces a novel analysis connecting mirror descent to natural policy gradient, enabling theoretical guarantees for parametric policies in complex action spaces.

Findings

01

Unified offline RL and imitation learning framework.

02

Theoretical guarantees for large or continuous action spaces.

03

New insights into policy optimization algorithms.

Abstract

We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.