Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning
Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, and Youhei Akimoto

TL;DR
This paper introduces a sample-efficient method for hypergradient estimation in decentralized bi-level reinforcement learning, enabling effective optimization when the leader cannot intervene directly in the follower's process.
Contribution
It presents a novel hypergradient formulation using the Boltzmann covariance trick, allowing efficient, sample-based optimization in high-dimensional, decentralized RL settings.
Findings
Effective hypergradient estimation from interaction samples.
First method for hypergradient-based optimization in decentralized 2-player Markov games.
Demonstrated success in discrete and continuous state tasks.
Abstract
Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
