Stackelberg Batch Policy Learning
Wenzhuo Zhou, Annie Qu

TL;DR
This paper introduces StackelbergLearner, a novel game-theoretic algorithm for batch reinforcement learning that models hierarchical decision-making and provides theoretical guarantees without requiring data coverage.
Contribution
It proposes a new Stackelberg game-based learning algorithm with convergence guarantees and instance-dependent regret bounds under minimal assumptions.
Findings
Consistently outperforms state-of-the-art batch RL methods in benchmarks.
Provides regret bounds without requiring data coverage or Bellman closedness.
Introduces a novel leader-follower dynamic with convergence guarantees.
Abstract
Batch reinforcement learning (RL) defines the task of learning from a fixed batch of data lacking exhaustive exploration. Worst-case optimality algorithms, which calibrate a value-function model class from logged experience and perform some type of pessimistic evaluation under the learned model, have emerged as a promising paradigm for batch RL. However, contemporary works on this stream have commonly overlooked the hierarchical decision-making structure hidden in the optimization landscape. In this paper, we adopt a game-theoretical viewpoint and model the policy learning diagram as a two-player general-sum game with a leader-follower structure. We propose a novel stochastic gradient-based learning algorithm: StackelbergLearner, in which the leader player updates according to the total derivative of its objective instead of the usual individual gradient, and the follower player makes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Sports Analytics and Performance
