SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments
Shu Ishida, Jo\~ao F. Henriques

TL;DR
This paper introduces SOAP-RL, a novel reinforcement learning algorithm for POMDPs that propagates option advantages through time, improving robustness and option discovery in environments with partial observability.
Contribution
The paper proposes SOAP, a new policy gradient method for learning temporally consistent options in POMDPs, outperforming existing algorithms like PPOEM and baselines such as LSTM and Option-Critic.
Findings
SOAP outperforms PPOEM, LSTM, and Option-Critic baselines.
SOAP successfully discovers options in POMDP corridor environments.
SOAP demonstrates robust performance on Atari and MuJoCo benchmarks.
Abstract
This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted objectives, learning temporally consistent options and associated sub-policies without explicit supervision is a challenge. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem. PPOEM applies the forward-backward algorithm (for Hidden Markov Models) to optimize the expected returns for an option-augmented policy. However, this learning approach is unstable during on-policy rollouts. It is also unsuited for learning causal policies without the knowledge of future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Grid Energy Management · Age of Information Optimization
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
