Stackelberg Batch Policy Learning

Wenzhuo Zhou; Annie Qu

arXiv:2309.16188·stat.ML·October 3, 2023

Stackelberg Batch Policy Learning

Wenzhuo Zhou, Annie Qu

PDF

Open Access

TL;DR

This paper introduces StackelbergLearner, a novel game-theoretic algorithm for batch reinforcement learning that models hierarchical decision-making and provides theoretical guarantees without requiring data coverage.

Contribution

It proposes a new Stackelberg game-based learning algorithm with convergence guarantees and instance-dependent regret bounds under minimal assumptions.

Findings

01

Consistently outperforms state-of-the-art batch RL methods in benchmarks.

02

Provides regret bounds without requiring data coverage or Bellman closedness.

03

Introduces a novel leader-follower dynamic with convergence guarantees.

Abstract

Batch reinforcement learning (RL) defines the task of learning from a fixed batch of data lacking exhaustive exploration. Worst-case optimality algorithms, which calibrate a value-function model class from logged experience and perform some type of pessimistic evaluation under the learned model, have emerged as a promising paradigm for batch RL. However, contemporary works on this stream have commonly overlooked the hierarchical decision-making structure hidden in the optimization landscape. In this paper, we adopt a game-theoretical viewpoint and model the policy learning diagram as a two-player general-sum game with a leader-follower structure. We propose a novel stochastic gradient-based learning algorithm: StackelbergLearner, in which the leader player updates according to the total derivative of its objective instead of the usual individual gradient, and the follower player makes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Sports Analytics and Performance