Offline Policy Optimization in RL with Variance Regularizaton
Riashat Islam, Samarth Sinha, Homanga Bharadhwaj, Samin Yeasar Arnob,, Zhuoran Yang, Animesh Garg, Zhaoran Wang, Lihong Li, Doina Precup

TL;DR
This paper introduces a variance regularization technique for offline RL that reduces over-estimation and distributional shift issues, improving policy learning stability and performance across continuous control tasks.
Contribution
The authors propose a novel variance regularizer using Fenchel duality for offline RL, compatible with existing algorithms, and demonstrate its effectiveness in reducing over-estimation errors.
Findings
Lower bound to offline policy optimization objective
Improved performance over state-of-the-art algorithms
Effective in continuous control domains
Abstract
Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Advanced Bandit Algorithms Research
