Reward-Mixing MDPs with a Few Latent Contexts are Learnable

Jeongyeol Kwon; Yonathan Efroni; Constantine Caramanis; Shie Mannor

arXiv:2210.02594·cs.LG·October 7, 2022

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

PDF

Open Access

TL;DR

This paper introduces an efficient algorithm for learning near-optimal policies in reward-mixing MDPs with multiple latent reward models, resolving open questions for arbitrary M and establishing fundamental complexity bounds.

Contribution

It extends previous work to arbitrary M, providing a sample-efficient algorithm with theoretical guarantees and new techniques for higher-order moments in RMMDPs.

Findings

01

The exttt{EM}^2 algorithm achieves near-optimal policy learning with polynomial sample complexity.

02

A lower bound shows super-polynomial sample complexity is necessary in M.

03

The method generalizes the method-of-moments approach to complex reward-mixing scenarios.

Abstract

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for $M = 2$ . In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary $M \geq 2$ , we provide a sample-efficient algorithm-- $EM^{2}$ --that outputs an $ϵ$ -optimal policy using $\tilde{O} (ϵ^{- 2} \cdot S^{d} A^{d} \cdot poly (H, Z)^{d})$ episodes, where $S, A$ are the number of states and actions respectively, $H$ is the time-horizon, $Z$ is the support size of reward distributions and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics