Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data

Rui Miao; Babak Shahbaba; Annie Qu

arXiv:2505.09496·stat.ML·March 10, 2026

Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data

Rui Miao, Babak Shahbaba, Annie Qu

PDF

Open Access 1 Video

TL;DR

This paper introduces a new offline reinforcement learning framework that personalizes policies for heterogeneous populations using individual latent variables, improving policy optimization in diverse environments.

Contribution

It proposes an individualized offline policy optimization method with a novel heterogeneous model and P4L algorithm, addressing heterogeneity in offline RL.

Findings

01

P4L guarantees a fast average regret rate.

02

The method outperforms existing approaches in simulations.

03

Real data application confirms superior performance.

Abstract

Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

REINFORCEMENT LEARNING FOR INDIVIDUAL OPTIMAL POLICY FROM HETEROGENEOUS DATA· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference

MethodsFocus