Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning
Zijun Chen, Zihan Zhang

TL;DR
This paper introduces improved regret bounds for episodic reinforcement learning with context-dependent action sets, extending the MVP algorithm and providing both minimax and gap-dependent guarantees.
Contribution
It extends the MVP algorithm to handle context-dependent action sets and derives new minimax, stochastic, and gap-dependent regret bounds with theoretical guarantees.
Findings
Established a minimax regret bound of O( H^3 K \u007c L) for adversarial contexts.
Derived a regret bound of O( H^3 K) for stochastic contexts.
Provided a sample complexity bound of O( H^3 / ^2) for fixed context distributions.
Abstract
We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, , where represents the action context in the -th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of for adversarial contexts, where denotes the number of possible contexts. This result implies a regret bound of for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of for a fixed context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
