DualDICE: Behavior-Agnostic Estimation of Discounted Stationary   Distribution Corrections

Ofir Nachum; Yinlam Chow; Bo Dai; Lihong Li

arXiv:1906.04733·cs.LG·November 6, 2019·42 cites

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li

PDF

Open Access 2 Repos

TL;DR

DualDICE is a novel algorithm for estimating discounted stationary distribution ratios in offline reinforcement learning, independent of behavior policy knowledge, improving accuracy and stability over prior methods.

Contribution

It introduces a behavior-agnostic approach to estimate distribution ratios without importance weights, enhancing stability and applicability in offline RL.

Findings

01

Significantly improves accuracy in off-policy evaluation

02

Avoids importance weights, reducing optimization instability

03

Provides theoretical guarantees for estimation accuracy

Abstract

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Advanced Bandit Algorithms Research