State-Action Similarity-Based Representations for Off-Policy Evaluation
Brahma S. Pavse, Josiah P. Hanna

TL;DR
This paper introduces a novel state-action similarity-based representation learning method to improve the data-efficiency and accuracy of off-policy evaluation in reinforcement learning, particularly enhancing the FQE algorithm.
Contribution
The paper proposes an OPE-specific state-action similarity metric and a learned encoder that improves FQE's data-efficiency and robustness against distribution shifts.
Findings
The proposed method outperforms other representation learning approaches in OPE tasks.
It reduces divergence of FQE under distribution shifts.
The learned representations improve OPE accuracy and data-efficiency.
Abstract
In reinforcement learning, off-policy evaluation (OPE) is the problem of estimating the expected return of an evaluation policy given a fixed dataset that was collected by running one or more different policies. One of the more empirically successful algorithms for OPE has been the fitted q-evaluation (FQE) algorithm that uses temporal difference updates to learn an action-value function, which is then used to estimate the expected return of the evaluation policy. Typically, the original fixed dataset is fed directly into FQE to learn the action-value function of the evaluation policy. Instead, in this paper, we seek to enhance the data-efficiency of FQE by first transforming the fixed dataset using a learned encoder, and then feeding the transformed dataset into FQE. To learn such an encoder, we introduce an OPE-tailored state-action behavioral similarity metric, and use this metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
