ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization
Yifei Chen, Shaoqin Zhu, Xiaoqiang Ji

TL;DR
The paper introduces ISEP, a method that implicitly expands action support in offline reinforcement learning using stochastic policy optimization, enabling better exploration of high-reward regions while maintaining safety guarantees.
Contribution
It proposes a novel implicit support expansion technique with stochastic policy optimization, addressing mode collapse and invalid actions in offline RL.
Findings
Implicit support expansion densifies high-reward regions.
Stochastic action selection mitigates mode collapse.
ISEP-FM effectively captures the interpolated value signal.
Abstract
Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism "densifies" high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
