SOAP-RL: Sequential Option Advantage Propagation for Reinforcement   Learning in POMDP Environments

Shu Ishida; Jo\~ao F. Henriques

arXiv:2407.18913·cs.LG·October 14, 2024·1 cites

SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments

Shu Ishida, Jo\~ao F. Henriques

PDF

Open Access 1 Repo

TL;DR

This paper introduces SOAP-RL, a novel reinforcement learning algorithm for POMDPs that propagates option advantages through time, improving robustness and option discovery in environments with partial observability.

Contribution

The paper proposes SOAP, a new policy gradient method for learning temporally consistent options in POMDPs, outperforming existing algorithms like PPOEM and baselines such as LSTM and Option-Critic.

Findings

01

SOAP outperforms PPOEM, LSTM, and Option-Critic baselines.

02

SOAP successfully discovers options in POMDP corridor environments.

03

SOAP demonstrates robust performance on Atari and MuJoCo benchmarks.

Abstract

This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted objectives, learning temporally consistent options and associated sub-policies without explicit supervision is a challenge. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem. PPOEM applies the forward-backward algorithm (for Hidden Markov Models) to optimize the expected returns for an option-augmented policy. However, this learning approach is unstable during on-policy rollouts. It is also unsuited for learning causal policies without the knowledge of future…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shuishida/soaprl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Grid Energy Management · Age of Information Optimization

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory