Near-optimal Regret Using Policy Optimization in Online MDPs with   Aggregate Bandit Feedback

Tal Lancewicki; Yishay Mansour

arXiv:2502.04004·cs.LG·February 7, 2025

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

Tal Lancewicki, Yishay Mansour

PDF

Open Access

TL;DR

This paper introduces the first policy optimization algorithms for online finite-horizon MDPs with aggregate bandit feedback, achieving near-optimal regret bounds in both known and unknown dynamics scenarios.

Contribution

It presents novel policy optimization algorithms for this challenging setting and establishes the first optimal regret bounds in the known-dynamics case.

Findings

01

Achieved the first optimal regret bound of rac{H^2 ext{poly}(S,A)}{ ext{poly}(K)} in known dynamics.

02

Established a regret bound of O(H^3 S \u221a{A K}) in unknown dynamics, improving previous results.

03

Demonstrated the effectiveness of policy optimization in settings with aggregate bandit feedback.

Abstract

We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first \textit{optimal} regret bound of $\tilde{Θ} (H^{2} S A K)$ , where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde{O} (H^{3} S A K)$ , significantly improving the best known result by a factor of $H^{2} S^{5} A^{2}$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Optimization and Search Problems