Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Dillon Sandhu, Ronald Parr

TL;DR
This paper introduces Approximate Next Policy Sampling (ANPS), a novel approach that modifies training data distribution to enable larger policy updates in deep reinforcement learning, improving stability and performance.
Contribution
It proposes ANPS and SV-API, a new method that allows safe, larger policy updates by adjusting training data distribution rather than constraining updates.
Findings
SV-PPO matches or exceeds baseline performance on Atari and continuous control tasks.
ANPS enables larger policy updates while maintaining or improving safety.
The approach guarantees safety under certain stability conditions.
Abstract
We revisit a classic "chicken-and-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is unknown and cannot be sampled for the purposes of training the value function. Conservative updates solve this problem, but at the cost of shrinking the policy update. This paper explores an alternative solution, Approximate Next Policy Sampling (ANPS), which addresses the problem by modifying the training distribution rather than constraining the policy update. ANPS is satisfied if the distribution of the training data approximates that of the next policy. To demonstrate the feasibility and efficacy of ANPS, we introduce Stable Value Approximate Policy Iteration (SV-API). SV-API modifies the standard approximate policy iteration loop to hold the target policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
