Markov Decision Processes with Continuous Side Information
Aditya Modi, Nan Jiang, Satinder Singh, Ambuj Tewari

TL;DR
This paper studies reinforcement learning in episodic Markov Decision Processes where each episode's dynamics depend on observed context, proposing algorithms under smoothness assumptions and analyzing their theoretical PAC bounds.
Contribution
It introduces algorithms for contextual MDPs with smooth parameter variation and provides PAC bounds, including a tractable linear setting with KWIK-based learning.
Findings
PAC bounds under smoothness assumptions
Lower bound showing exponential dependence on dimension
A linear setting with a KWIK-based PAC algorithm
Abstract
We consider a reinforcement learning (RL) setting in which the agent interacts with a sequence of episodic MDPs. At the start of each episode the agent has access to some side-information or context that determines the dynamics of the MDP for that episode. Our setting is motivated by applications in healthcare where baseline measurements of a patient at the start of a treatment episode form the context that may provide information about how the patient might respond to treatment decisions. We propose algorithms for learning in such Contextual Markov Decision Processes (CMDPs) under an assumption that the unobserved MDP parameters vary smoothly with the observed context. We also give lower and upper PAC bounds under the smoothness assumption. Because our lower bound has an exponential dependence on the dimension, we consider a tractable linear setting where the context is used to create…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
