On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

Anvay Shah; Ramsundar Anandanarayanan; Sharayu Moharir; Shivaram Kalyanakrishnan

arXiv:2605.04979·cs.AI·May 7, 2026

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir, Shivaram Kalyanakrishnan

PDF

TL;DR

This paper introduces a novel approach to online learning in Tree MDPs by applying bandit algorithms to policies treated as arms, with innovative confidence bounds enabling polynomial efficiency.

Contribution

It develops a method to efficiently apply bandit algorithms to T-MDPs by sharing data among policies, overcoming exponential policy complexity.

Findings

01

Algorithms outperform alternatives in hidden-information games.

02

Instance-dependent bounds depend on terminal state gaps.

03

Polynomial memory and computation achieved for policy-based bandit algorithms.

Abstract

A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$ , in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.