Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data
Rui Miao, Babak Shahbaba, Annie Qu

TL;DR
This paper introduces a new offline reinforcement learning framework that personalizes policies for heterogeneous populations using individual latent variables, improving policy optimization in diverse environments.
Contribution
It proposes an individualized offline policy optimization method with a novel heterogeneous model and P4L algorithm, addressing heterogeneity in offline RL.
Findings
P4L guarantees a fast average regret rate.
The method outperforms existing approaches in simulations.
Real data application confirms superior performance.
Abstract
Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference
MethodsFocus
