Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

Zijun Chen; Zihan Zhang

arXiv:2605.15692·cs.LG·May 18, 2026

Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

Zijun Chen, Zihan Zhang

PDF

TL;DR

This paper introduces improved regret bounds for episodic reinforcement learning with context-dependent action sets, extending the MVP algorithm and providing both minimax and gap-dependent guarantees.

Contribution

It extends the MVP algorithm to handle context-dependent action sets and derives new minimax, stochastic, and gap-dependent regret bounds with theoretical guarantees.

Findings

01

Established a minimax regret bound of O( H^3 K \u007c L) for adversarial contexts.

02

Derived a regret bound of O( H^3 K) for stochastic contexts.

03

Provided a sample complexity bound of O( H^3 / ^2) for fixed context distributions.

Abstract

We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k = 1}^{K} [V^{*, M^{k}} - V^{π^{k}, M^{k}}]$ , where $M^{k}$ represents the action context in the $k$ -th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $O (S A H^{3} K lo g L)$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $O (S A H^{3} K)$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $O (S A H^{3} / ϵ^{2})$ for a fixed context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.