Kernel Metric Learning for In-Sample Off-Policy Evaluation of   Deterministic RL Policies

Haanvid Lee; Tri Wahyu Guntara; Jongmin Lee; Yung-Kyun Noh; Kee-Eung; Kim

arXiv:2405.18792·cs.LG·May 30, 2024

Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies

Haanvid Lee, Tri Wahyu Guntara, Jongmin Lee, Yung-Kyun Noh, Kee-Eung, Kim

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a kernel-based method for off-policy evaluation of deterministic policies in continuous action spaces, reducing variance and improving accuracy over traditional importance sampling methods.

Contribution

It proposes a novel kernel relaxation approach with learned metrics to enhance in-sample off-policy evaluation for deterministic policies in continuous environments.

Findings

01

Kernel-based approach reduces estimation error.

02

Optimized kernel metrics improve evaluation accuracy.

03

Method outperforms existing baselines in empirical tests.

Abstract

We consider off-policy evaluation (OPE) of deterministic target policies for reinforcement learning (RL) in environments with continuous action spaces. While it is common to use importance sampling for OPE, it suffers from high variance when the behavior policy deviates significantly from the target policy. In order to address this issue, some recent works on OPE proposed in-sample learning with importance resampling. Yet, these approaches are not applicable to deterministic target policies for continuous action spaces. To address this limitation, we propose to relax the deterministic target policy using a kernel and learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function, where the action value function is used for policy evaluation. We derive the bias and variance of the estimation error due…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haanvid/kmifqe
pytorchOfficial

Videos

Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies· slideslive

Taxonomy

TopicsSoftware Reliability and Analysis Research