Design from Policies: Conservative Test-Time Adaptation for Offline   Policy Optimization

Jinxin Liu; Hongyin Zhang; Zifeng Zhuang; Yachen Kang; Donglin Wang,; Bin Wang

arXiv:2306.14479·cs.LG·October 31, 2023

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

Jinxin Liu, Hongyin Zhang, Zifeng Zhuang, Yachen Kang, Donglin Wang,, Bin Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces DROP, a non-iterative offline RL paradigm that separates value estimation and policy extraction, enabling safe, adaptive policy deployment during testing with improved performance.

Contribution

DROP proposes a novel non-iterative bi-level offline RL framework that answers key questions about information transfer and safe exploitation, with a focus on model-based optimization and adaptive inference.

Findings

01

DROP achieves comparable or better performance than prior methods.

02

DROP enables safe and adaptive policy deployment during testing.

03

The method effectively decomposes data and learns a conservative score model.

Abstract

In this work, we decouple the iterative bi-level offline RL (value estimation and policy extraction) from the offline training phase, forming a non-iterative bi-level paradigm and avoiding the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization (value estimation) in training, while performing outer-level optimization (policy extraction) in testing. Naturally, such a paradigm raises three core questions that are not fully answered by prior non-iterative offline RL counterparts like reward-conditioned policy: (q1) What information should we transfer from the inner-level to the outer-level? (q2) What should we pay attention to when exploiting the transferred information for safe/confident outer-level optimization? (q3) What are the benefits of concurrently conducting outer-level optimization during testing?…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization· slideslive

Taxonomy

TopicsAge of Information Optimization · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning