Accelerating Proximal Policy Optimization Learning Using Task Prediction   for Solving Environments with Delayed Rewards

Ahmad Ahmad; Mehdi Kermanshah; Kevin Leahy; Zachary Serlin; Ho Chit; Siu; Makai Mann; Cristian-Ioan Vasile; Roberto Tron; Calin Belta

arXiv:2411.17861·cs.LG·December 6, 2024

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards

Ahmad Ahmad, Mehdi Kermanshah, Kevin Leahy, Zachary Serlin, Ho Chit, Siu, Makai Mann, Cristian-Ioan Vasile, Roberto Tron, Calin Belta

PDF

Open Access

TL;DR

This paper enhances Proximal Policy Optimization for reinforcement learning with delayed rewards by integrating offline expert data and temporal logic-based reward shaping, leading to faster and more effective learning.

Contribution

It introduces a hybrid PPO architecture with reward shaping via TWTL, providing theoretical guarantees and improved empirical performance in delayed reward environments.

Findings

01

Faster learning speed in inverted pendulum and lunar lander environments.

02

Improved final performance over standard PPO.

03

Theoretical proof of performance bounds and reward preservation.

Abstract

In this paper, we tackle the challenging problem of delayed rewards in reinforcement learning (RL). While Proximal Policy Optimization (PPO) has emerged as a leading Policy Gradient method, its performance can degrade under delayed rewards. We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy (trained on expert demonstrations) with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL). The hybrid architecture leverages offline data throughout training while maintaining PPO's theoretical guarantees. Building on the monotonic improvement framework of Trust Region Policy Optimization (TRPO), we prove that our approach ensures improvement over both the offline policy and previous iterations, with a bounded performance gap of $(2 ς γ α^{2}) / (1 - γ)^{2}$ , where $α$ is the mixing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · IoT and Edge/Fog Computing · Software Reliability and Analysis Research

MethodsEntropy Regularization · Proximal Policy Optimization · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings