Bi-Level Policy Optimization with Nystr\"om Hypergradients

Arjun Prakash; Naicheng He; Denizalp Goktas; Jacob Makar-Limanov; Amy Greenwald

arXiv:2505.11714·cs.LG·March 19, 2026

Bi-Level Policy Optimization with Nystr\"om Hypergradients

Arjun Prakash, Naicheng He, Denizalp Goktas, Jacob Makar-Limanov, Amy Greenwald

PDF

Open Access

TL;DR

This paper introduces BLPO, a novel bilevel policy optimization algorithm using Nystr"om hypergradients, which improves actor-critic reinforcement learning by accounting for the nested critic-actor dependency and ensuring convergence.

Contribution

The paper proposes BLPO, a bilevel policy optimization method that employs Nystr"om hypergradients for stable and efficient learning in actor-critic algorithms, with theoretical convergence guarantees.

Findings

01

BLPO converges to a local strong Stackelberg equilibrium in polynomial time.

02

BLPO performs comparably or better than PPO on various control tasks.

03

The Nystr"om method stabilizes hypergradient computation in bilevel RL.

Abstract

The dependency of the actor on the critic in actor-critic (AC) reinforcement learning means that AC can be characterized as a bilevel optimization (BLO) problem, also called a Stackelberg game. This characterization motivates two modifications to vanilla AC algorithms. First, the critic's update should be nested to learn a best response to the actor's policy. Second, the actor should update according to a hypergradient that takes changes in the critic's behavior into account. Computing this hypergradient involves finding an inverse Hessian vector product, a process that can be numerically unstable. We thus propose a new algorithm, Bilevel Policy Optimization with Nystr\"om Hypergradients (BLPO), which uses nesting to account for the nested structure of BLO, and leverages the Nystr\"om method to compute the hypergradient. Theoretically, we prove BLPO converges to (a point that satisfies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdaptive Dynamic Programming Control · Reinforcement Learning in Robotics · Optimization and Variational Analysis

MethodsEntropy Regularization · Proximal Policy Optimization