Bi-Level Policy Optimization with Nystr\"om Hypergradients
Arjun Prakash, Naicheng He, Denizalp Goktas, Jacob Makar-Limanov, Amy Greenwald

TL;DR
This paper introduces BLPO, a novel bilevel policy optimization algorithm using Nystr"om hypergradients, which improves actor-critic reinforcement learning by accounting for the nested critic-actor dependency and ensuring convergence.
Contribution
The paper proposes BLPO, a bilevel policy optimization method that employs Nystr"om hypergradients for stable and efficient learning in actor-critic algorithms, with theoretical convergence guarantees.
Findings
BLPO converges to a local strong Stackelberg equilibrium in polynomial time.
BLPO performs comparably or better than PPO on various control tasks.
The Nystr"om method stabilizes hypergradient computation in bilevel RL.
Abstract
The dependency of the actor on the critic in actor-critic (AC) reinforcement learning means that AC can be characterized as a bilevel optimization (BLO) problem, also called a Stackelberg game. This characterization motivates two modifications to vanilla AC algorithms. First, the critic's update should be nested to learn a best response to the actor's policy. Second, the actor should update according to a hypergradient that takes changes in the critic's behavior into account. Computing this hypergradient involves finding an inverse Hessian vector product, a process that can be numerically unstable. We thus propose a new algorithm, Bilevel Policy Optimization with Nystr\"om Hypergradients (BLPO), which uses nesting to account for the nested structure of BLO, and leverages the Nystr\"om method to compute the hypergradient. Theoretically, we prove BLPO converges to (a point that satisfies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdaptive Dynamic Programming Control · Reinforcement Learning in Robotics · Optimization and Variational Analysis
MethodsEntropy Regularization · Proximal Policy Optimization
