Functional Natural Policy Gradients

Aurelien Bibaut; Houssam Zenati; Thibaud Rahier; Nathan Kallus

arXiv:2603.28681·stat.ML·April 6, 2026

Functional Natural Policy Gradients

Aurelien Bibaut, Houssam Zenati, Thibaud Rahier, Nathan Kallus

PDF

TL;DR

This paper introduces a new policy learning method from offline data that achieves near-optimal regret bounds by balancing policy complexity and environment dynamics.

Contribution

It proposes a cross-fitted debiasing device enabling $\

Findings

01

Achieves $\

02

findings

Abstract

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O (N^{- 1/2})$ . The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.