Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization
Jinxin Liu, Hongyin Zhang, Zifeng Zhuang, Yachen Kang, Donglin Wang,, Bin Wang

TL;DR
This paper introduces DROP, a non-iterative offline RL paradigm that separates value estimation and policy extraction, enabling safe, adaptive policy deployment during testing with improved performance.
Contribution
DROP proposes a novel non-iterative bi-level offline RL framework that answers key questions about information transfer and safe exploitation, with a focus on model-based optimization and adaptive inference.
Findings
DROP achieves comparable or better performance than prior methods.
DROP enables safe and adaptive policy deployment during testing.
The method effectively decomposes data and learns a conservative score model.
Abstract
In this work, we decouple the iterative bi-level offline RL (value estimation and policy extraction) from the offline training phase, forming a non-iterative bi-level paradigm and avoiding the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization (value estimation) in training, while performing outer-level optimization (policy extraction) in testing. Naturally, such a paradigm raises three core questions that are not fully answered by prior non-iterative offline RL counterparts like reward-conditioned policy: (q1) What information should we transfer from the inner-level to the outer-level? (q2) What should we pay attention to when exploiting the transferred information for safe/confident outer-level optimization? (q3) What are the benefits of concurrently conducting outer-level optimization during testing?…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAge of Information Optimization · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
