Learning Markov Decision Processes under Fully Bandit Feedback

Zhengjia Zhuo; Anupam Gupta; Viswanath Nagarajan

arXiv:2602.02260·cs.LG·February 3, 2026

Learning Markov Decision Processes under Fully Bandit Feedback

Zhengjia Zhuo, Anupam Gupta, Viswanath Nagarajan

PDF

Open Access

TL;DR

This paper introduces the first efficient algorithm for learning in episodic MDPs under fully bandit feedback, achieving near-optimal regret bounds and demonstrating competitive empirical performance despite extremely limited feedback.

Contribution

It presents a novel bandit learning algorithm for episodic MDPs with fully bandit feedback, providing the first such theoretical guarantees and empirical evaluation.

Findings

01

Achieves $ ilde{O}( oot{T} otag)$ regret in fully bandit feedback setting.

02

Exponential dependence on horizon length $ ext{H}$ is necessary for regret bounds.

03

Empirical results show competitive performance with state-of-the-art algorithms.

Abstract

A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight $Θ (T)$ -regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, but only a single \emph{aggregate} reward. In this paper, we consider a far more restrictive ``fully bandit'' feedback model for episodic MDPs, where the agent does not even observe the visited state-action pairs -- it only learns the aggregate reward. We provide the first efficient bandit learning algorithm for episodic MDPs with $O (T)$ regret. Our regret has an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization