A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs
Junyue Zhang, Yifen Mu

TL;DR
This paper introduces a novel payoff-based policy gradient algorithm for stochastic games with long-run average payoffs, proving its convergence to Nash equilibria under broad stability conditions.
Contribution
It develops an equivalent gradient formulation, demonstrates Lipschitz continuity, and constructs a bandit learning algorithm with proven convergence for such games.
Findings
Gradient dominance property established for value functions
Algorithm converges to Nash equilibrium with probability one
Applicable to a wide class of stable stochastic games
Abstract
Despite the significant potential for various applications, stochastic games with long-run average payoffs have received limited scholarly attention, particularly concerning the development of learning algorithms for them due to the challenges of mathematical analysis. In this paper, we study the stochastic games with long-run average payoffs and present an equivalent formulation for individual payoff gradients by defining advantage functions which will be proved to be bounded. This discovery allows us to demonstrate that the individual payoff gradient function is Lipschitz continuous with respect to the policy profile and that the value function of the games exhibits the gradient dominance property. Leveraging these insights, we devise a payoff-based gradient estimation approach and integrate it with the Regularized Robbins-Monro method from stochastic approximation theory to construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic Policies and Impacts · Risk and Portfolio Optimization · Optimization and Search Problems
