Thompson Sampling Algorithm for Stochastic Games
Asaf Cohen, Ruolan He, Yuqiong Wang

TL;DR
This paper introduces a Thompson sampling algorithm for stochastic differential games with multiple players, providing regret bounds and convergence to Nash equilibrium in a complex dynamic setting.
Contribution
It develops a novel Thompson sampling approach with dynamic episodes for multi-player stochastic games, achieving regret bounds independent of the number of players.
Findings
Bayesian regret is bounded by O(√T log T) for each player.
Average regret per unit time approaches zero as T increases.
The algorithm converges to a Nash equilibrium in the game.
Abstract
We study a stochastic differential game with competitive players in a linear-quadratic framework with ergodic cost, where -dimensional diffusion processes govern the state dynamics with an unknown common drift (matrix). Assuming a Gaussian prior on the drift, we use filtering techniques to update its posterior estimates. Based on these estimates, we propose a Thompson-sampling-based algorithm with dynamic episode lengths to approximate strategies. We show that the Bayesian regret for each player has an error bound of order , where is the time-horizon, independent of the number of players. This implies that average regret per unit time goes to zero. Finally, we prove that the algorithm results in a Nash equilibrium.
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper proves the Bayesian regret bound for ergodic $N$-player games, matching best-known orders for LQ control while not depending on number of players $N$. * Propose Thomspon sampling algorithm to handle the unknown $A$ setting, proving the Nash equilibrium of the TS profile under additional stability conditions.
* The model is based on the previous work Bardi & Priuli (2014a). This paper handles the setting where the matrix $A$ is unknown but does not explain why this setting is meaningful in practice. * This paper focus on the theoretical side, the author use a full section (Section 2) to introduce existing results but does not claim their technique contribution over previous work.
I am not sufficiently knowledgeable in stochastic differential games to provide an expert opinion but, as far as I could tell, the authors' analysis is sound, and the positioning of their contributions in the surrounding literature is fair. The contribution itself seems in line with what could be expected from a good paper in the field.
My main concern with this paper is its thematic alignment with ICLR. Even though I could easily see this paper published in a top-tier control venue (IEEE CDC, TAC or SIOPT SICON), the fit with ICLR is very slim. This would be less of an issue if the field of (stochastic) differential games were more accessible from a technical standpoint but, as it currently stands, the paper's technical content and contributions would only be accessible to an infinitesimally thin slice of ICLR's generalist aud
1. This paper propose a novel approach for solving multi-player SDGs using Thompson Sampling, extending prior work's approach. 2. The framework also relax assumptions in previous work on independence of the players. 3. The results on Nash equilibrium is very interesting and potentially impactful to the field.
I am not in this field, so my feedback might be limited and please correct me if I am wrong. That said, there are a few of my concerns 1. The scope of this work is quite limited. The authors assume there is no coupling between the players from the dynamics side, but only through costs. Although the authors motivates the scenarios in the intro, but the applicability of such framework remains elusive, and quite restrictive. 2. The tools the authors used seem to be borrowed from prior works. I.e.,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Game Theory and Applications · Reinforcement Learning in Robotics
