Mutual Information Regularized Offline Reinforcement Learning
Xiao Ma, Bingyi Kang, Zhongwen Xu, Min Lin, Shuicheng Yan

TL;DR
This paper introduces MISA, a mutual information regularization framework for offline reinforcement learning that constrains policy improvement to data-supported actions, leading to improved performance over existing methods.
Contribution
MISA unifies and extends existing offline RL methods by directly constraining policy updates via mutual information bounds, enhancing stability and performance.
Findings
MISA outperforms baselines on D4RL benchmarks.
Tighter mutual information bounds improve offline RL results.
MISA achieves 742.9 points on gym-locomotion tasks.
Abstract
The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy or value for deviating from the behavior policy during policy improvement or evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold. The resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRobotic Locomotion and Control · Reinforcement Learning in Robotics
